[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: Text extraction from PDF files (non-European languages)...?

Kouhei Sutou

11/22/2006 12:53:00 AM

Hi,

2006/11/22, Nuralanur@aol.com <Nuralanur@aol.com>:

> is there a way of extracting text from a PDF, if the latter
> is in some non-European language, such as Arabic or
> Chinese?
> Under Linux, I have been able to use Ruby in conjunction
> with pdftotext for English and other Latin1 encoded texts -
> with some problems sometimes for special characters,
> but it doesn't seem to work for Unicode ...

Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.

> Is there a Ruby way to do this ?

You can use Ruby/Poppler if poppler doesn't have any problem:
http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&v...


Thanks,
--
kou