Asp Forum - Cut pages for OCR with RMagick?

Axel Etzold

9/29/2007 2:45:00 PM

Dear all,

I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
<---- cut here, at this blank
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
^
|
--- (Then cut vertically)

I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

Thank you very much,

Axel

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/s...

2 Answers

Ilmari Heikkinen

9/29/2007 8:14:00 PM

On 9/29/07, Axel Etzold <AEtzold@gmx.de> wrote:
> Dear all,
>
> I have many scanned pages which I'd like to cut to prepare them
> for OCR.
> There are two things I'd like to do:
>
> 1.) Cut off a header of each page containing the page number,
>
> 2.) Find the largest horizontal blanks in a page (which are supposed
> to separate chapters) like this:

If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:

scanline_bytes = image_width * bytes_per_pixel
scanlines = pixels.scan(/.{#{scanline_bytes}}/)
chapters = [[]]
scanlines.each{|sl|
if is_white(sl)
chapters << [] unless paragraphs.last.empty?
else
chapters.last << sl
end
}

For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.

To crop the margins off the chapter scanlines:

left_border = chapter_scanlines.min{|sl| sl =~ /#{non_white_pixel}/ }
left_border -= left_border % bytes_per_pixel
right_border = chapter_scanlines.min{|sl| sl.reverse =~ /#{non_white_pixel}/ }
right_border -= right_border % bytes_per_pixel

chapter_scanlines.map!{|sl| sl[left_border..right_border] }

The middle whitespace can be had by (tune the magic number to signify
enough pixels to not be a character space):

left_border = chapter_scanlines.max{|sl| sl =~ /#{non_white_pixel}{20}/ }

and with reversed scanline for right border.

</imaging regexps for fun and profit>

HTH,
--
Ilmari Heikkinen
http://fhtr.bl...

Tim Hunter

9/29/2007 8:40:00 PM

Axel Etzold wrote:

> I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
> but this is quite slow. Is there a better method, ie. using to_blob ?

to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there's no help there.

Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they'd be much faster. Offhand I can't think of any such
methods, but then I'm not very clever at that sort of thing.

You might try asking the ImageMagick gurus
(http://www.imagemagick.org/discour...) if there's a way to do it
with the command-line utilities. If so, you can usually translate the
commands and options into RMagick methods. See
http://www.simplesystems.org/RMagick/doc/opt... for help with that.

--
RMagick OS X Installer [http://rubyforge.org/project...]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?for...]
RMagick Installation FAQ [http://rmagick.rubyforge.org/instal...]

comp.lang.ruby

Cut pages for OCR with RMagick?

Axel Etzold

Ilmari Heikkinen

Tim Hunter

x Login to ForumsZone