[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Extract Text from PDF

Mark Dodwell

4/13/2007 12:06:00 PM

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

--
Posted via http://www.ruby-....

5 Answers

Robert Klemme

4/13/2007 12:17:00 PM

0

On 13.04.2007 14:06, Mark Dodwell wrote:
> Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert

Chris Lowis

4/13/2007 12:26:00 PM

0

Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?
>
> IIRC there is a project under way to extend PDFWriter with reading
> capabilities. I don't know the current status of that. HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,


Chris

--
Posted via http://www.ruby-....

Kouhei Sutou

4/13/2007 1:19:00 PM

0

Hi,

2007/4/13, Mark Dodwell <seo@mkdynamic.co.uk>:

> Does anyone know a way to extract plain text from a PDF using Ruby?

You can use Ruby/Poppler:
http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby...

Here is an example to do that:
http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&v...


Thanks,
--
kou

M. Edward (Ed) Borasky

4/13/2007 1:21:00 PM

0

Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?
>
> IIRC there is a project under way to extend PDFWriter with reading
> capabilities. I don't know the current status of that. HTH
>
> robert
At least on Linux, there is "pdftotext", which is part of the "poppler"
package. So you can simply shell out to it if it's installed. If you're
more ambitious, you could write an extension to use the underlying
libraries in poppler.
>
>


--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-res...

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


John Joyce

4/13/2007 11:53:00 PM

0

The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.