Asp Forum - Re: Text extraction from PDF files (non-European languages)...?

Kouhei Sutou

11/22/2006 12:53:00 AM

Hi,

2006/11/22, Nuralanur@aol.com <Nuralanur@aol.com>:

> is there a way of extracting text from a PDF, if the latter
> is in some non-European language, such as Arabic or
> Chinese?
> Under Linux, I have been able to use Ruby in conjunction
> with pdftotext for English and other Latin1 encoded texts -
> with some problems sometimes for special characters,
> but it doesn't seem to work for Unicode ...

Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.

> Is there a Ruby way to do this ?

You can use Ruby/Poppler if poppler doesn't have any problem:
http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&v...

Thanks,
--
kou

Servizio di avviso nuovi messaggi

Ricevi direttamente nella tua mail i nuovi messaggi per
Re: Text extraction from PDF files (non-European languages)...?

Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te

Il servizio è completamente GRATUITO!

comp.lang.ruby

Re: Text extraction from PDF files (non-European languages)...?

Kouhei Sutou

x Login to ForumsZone