Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.ruby
Cut pages for OCR with RMagick?
Axel Etzold
9/29/2007 2:45:00 PM
Dear all,
I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:
1.) Cut off a header of each page containing the page number,
2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
<---- cut here, at this blank
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
^
|
--- (Then cut vertically)
I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?
Thank you very much,
Axel
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN:
http://www.gmx.net/de/go/s...
2 Answers
Ilmari Heikkinen
9/29/2007 8:14:00 PM
0
On 9/29/07, Axel Etzold <AEtzold@gmx.de> wrote:
> Dear all,
>
> I have many scanned pages which I'd like to cut to prepare them
> for OCR.
> There are two things I'd like to do:
>
> 1.) Cut off a header of each page containing the page number,
>
> 2.) Find the largest horizontal blanks in a page (which are supposed
> to separate chapters) like this:
If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:
scanline_bytes = image_width * bytes_per_pixel
scanlines = pixels.scan(/.{#{scanline_bytes}}/)
chapters = [[]]
scanlines.each{|sl|
if is_white(sl)
chapters << [] unless paragraphs.last.empty?
else
chapters.last << sl
end
}
For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.
To crop the margins off the chapter scanlines:
left_border = chapter_scanlines.min{|sl| sl =~ /#{non_white_pixel}/ }
left_border -= left_border % bytes_per_pixel
right_border = chapter_scanlines.min{|sl| sl.reverse =~ /#{non_white_pixel}/ }
right_border -= right_border % bytes_per_pixel
chapter_scanlines.map!{|sl| sl[left_border..right_border] }
The middle whitespace can be had by (tune the magic number to signify
enough pixels to not be a character space):
left_border = chapter_scanlines.max{|sl| sl =~ /#{non_white_pixel}{20}/ }
and with reversed scanline for right border.
</imaging regexps for fun and profit>
HTH,
--
Ilmari Heikkinen
http://fhtr.bl...
Tim Hunter
9/29/2007 8:40:00 PM
0
Axel Etzold wrote:
> I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
> but this is quite slow. Is there a better method, ie. using to_blob ?
to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there's no help there.
Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they'd be much faster. Offhand I can't think of any such
methods, but then I'm not very clever at that sort of thing.
You might try asking the ImageMagick gurus
(
http://www.imagemagick.org/discour...
) if there's a way to do it
with the command-line utilities. If so, you can usually translate the
commands and options into RMagick methods. See
http://www.simplesystems.org/RMagick/doc/opt...
for help with that.
--
RMagick OS X Installer [
http://rubyforge.org/project...
]
RMagick Hints & Tips [
http://rubyforge.org/forum/forum.php?for...
]
RMagick Installation FAQ [
http://rmagick.rubyforge.org/instal...
]
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
Cut pages for OCR with RMagick?
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password