Asp Forum - pdf library. - comp.lang.python

Shriphani

12/29/2007 6:55:00 PM

Hi,
I am looking for a pdf library that will give me a list of pages where
new chapters start. Can someone point me to such a module ?
Regards,
Shriphani P.

7 Answers

Benjamin

12/29/2007 9:42:00 PM

On Dec 29, 12:54 pm, Shriphani <shripha...@gmail.com> wrote:
> Hi,
> I am looking for a pdf library that will give me a list of pages where
> new chapters start. Can someone point me to such a module ?
ReportLab (ReportLab) might help.
> Regards,
> Shriphani P.

Waldemar Osuch

12/30/2007 12:08:00 AM

On Dec 29, 11:54 am, Shriphani <shripha...@gmail.com> wrote:
> Hi,
> I am looking for a pdf library that will give me a list of pages where
> new chapters start. Can someone point me to such a module ?
> Regards,
> Shriphani P.

pyPdf may help you with that:
http://pybrary....

Shriphani

1/1/2008 6:27:00 AM

On Dec 30 2007, 5:08 am, Waldemar Osuch <waldemar.os...@gmail.com>
wrote:
> On Dec 29, 11:54 am,Shriphani<shripha...@gmail.com> wrote:
>
> > Hi,
> > I am looking for a pdf library that will give me a list of pages where
> > new chapters start. Can someone point me to such a module ?
> > Regards,
> >ShriphaniP.
>
> pyPdf may help you with that:http://pybrary....

I tried pyPdf for this and decided to get the pagelinks. The trouble
is that I don't know how to determine whether a particular page is the
first page of a chapter. Can someone tell me how to do this ?

Piet van Oostrum

1/1/2008 11:28:00 AM

>>>>> Shriphani <shriphanip@gmail.com> (S) wrote:

>S> I tried pyPdf for this and decided to get the pagelinks. The trouble
>S> is that I don't know how to determine whether a particular page is the
>S> first page of a chapter. Can someone tell me how to do this ?

AFAIK PDF doesn't have the concept of "Chapter". If the document has an
outline, you could try to use the first level of that hierarchy as the
chapter starting points. But you don't have a guarantee that they really
are chapters.
--
Piet van Oostrum <piet@cs.uu.nl>
URL: http://pietvano... [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

Shriphani

1/1/2008 12:21:00 PM

On Jan 1, 4:28 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
> >>>>>Shriphani<shripha...@gmail.com> (S) wrote:
> >S> I tried pyPdf for this and decided to get the pagelinks. The trouble
> >S> is that I don't know how to determine whether a particular page is the
> >S> first page of a chapter. Can someone tell me how to do this ?
>
> AFAIK PDF doesn't have the concept of "Chapter". If the document has an
> outline, you could try to use the first level of that hierarchy as the
> chapter starting points. But you don't have a guarantee that they really
> are chapters.
> --
> Piet van Oostrum <p...@cs.uu.nl>
> URL:http://pietvano...[PGP 8DAE142BE17999C4]
> Private email: p...@vanoostrum.org

How would a pdf to html conversion work ? I've seen Google's search
engine do it loads of times. Just that running a 500odd page ebook
through one of those scripts might not be such a good idea.

Marc 'BlackJack' Rintsch

1/1/2008 12:38:00 PM

On Tue, 01 Jan 2008 04:21:29 -0800, Shriphani wrote:

> On Jan 1, 4:28 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
>> >>>>>Shriphani<shripha...@gmail.com> (S) wrote:
>> >S> I tried pyPdf for this and decided to get the pagelinks. The trouble
>> >S> is that I don't know how to determine whether a particular page is the
>> >S> first page of a chapter. Can someone tell me how to do this ?
>>
>> AFAIK PDF doesn't have the concept of "Chapter". If the document has an
>> outline, you could try to use the first level of that hierarchy as the
>> chapter starting points. But you don't have a guarantee that they really
>> are chapters.
>
> How would a pdf to html conversion work ? I've seen Google's search
> engine do it loads of times. Just that running a 500odd page ebook
> through one of those scripts might not be such a good idea.

Heuristics? Neither PDF nor HTML know "chapters". So it might be
guesswork or just in your head.

Ciao,
Marc 'BlackJack' Rintsch

Shriphani

1/2/2008 11:23:00 AM

On Jan 1, 5:38 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> On Tue, 01 Jan 2008 04:21:29 -0800,Shriphaniwrote:
> > On Jan 1, 4:28 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
> >> >>>>>Shriphani<shripha...@gmail.com> (S) wrote:
> >> >S> I tried pyPdf for this and decided to get the pagelinks. The trouble
> >> >S> is that I don't know how to determine whether a particular page is the
> >> >S> first page of a chapter. Can someone tell me how to do this ?
>
> >> AFAIK PDF doesn't have the concept of "Chapter". If the document has an
> >> outline, you could try to use the first level of that hierarchy as the
> >> chapter starting points. But you don't have a guarantee that they really
> >> are chapters.
>
> > How would a pdf to html conversion work ? I've seen Google's search
> > engine do it loads of times. Just that running a 500odd page ebook
> > through one of those scripts might not be such a good idea.
>
> Heuristics? Neither PDF nor HTML know "chapters". So it might be
> guesswork or just in your head.
>
> Ciao,
> Marc 'BlackJack' Rintsch

I could parse the html and check for the words "unit" or "chapter" at
the beginning of a page. I am using pdftohtml on Debian and it seems
to be generating the html versions of pdfs quite fast. I am yet to run
a 500 page pdf through it though.
Regards,
Shriphani

comp.lang.python

pdf library.

Shriphani

Benjamin

Waldemar Osuch

Shriphani

Piet van Oostrum

Shriphani

Marc 'BlackJack' Rintsch

Shriphani

x Login to ForumsZone