Asp Forum - Download only http headers

buunguyen

11/30/2006 12:27:00 AM

Hi folks,

I'm writing a small app which needs to list all HTML pages (only HTML
pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
any link, I use Net::HTTP to download the content and examine the
content-type header in order to determine whether it is a HTML page or
s/t else (I don't think pattern matching will work because there are
dynamic pages being able to return arbitrary resources [HTML, images
etc.]); however, I cannot find anyway to only read the headers without
reading the whole contents of that resource (e.g. a PDF). That would
make my app perform very slowly while all I want is just listing the
HTML pages.

Does anyone have any suggestion as to how this problem can be solved?

Thanks in advance

Nguyen

2 Answers

Tom Werner

11/30/2006 12:45:00 AM

Nguyen wrote:
> I'm writing a small app which needs to list all HTML pages (only HTML
> pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
> any link, I use Net::HTTP to download the content and examine the
> content-type header in order to determine whether it is a HTML page or
> s/t else (I don't think pattern matching will work because there are
> dynamic pages being able to return arbitrary resources [HTML, images
> etc.]); however, I cannot find anyway to only read the headers without
> reading the whole contents of that resource (e.g. a PDF). That would
> make my app perform very slowly while all I want is just listing the
> HTML pages.
>

Net::HTTP has the ability to do HEAD commands in addition to GET. You
can get the headers that way.

Tom

buunguyen

11/30/2006 2:08:00 AM

Work like charm. Thanks a great deal, Tom.

Nguyen

Tom Werner wrote:
> Nguyen wrote:
> > I'm writing a small app which needs to list all HTML pages (only HTML
> > pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
> > any link, I use Net::HTTP to download the content and examine the
> > content-type header in order to determine whether it is a HTML page or
> > s/t else (I don't think pattern matching will work because there are
> > dynamic pages being able to return arbitrary resources [HTML, images
> > etc.]); however, I cannot find anyway to only read the headers without
> > reading the whole contents of that resource (e.g. a PDF). That would
> > make my app perform very slowly while all I want is just listing the
> > HTML pages.
> >
>
> Net::HTTP has the ability to do HEAD commands in addition to GET. You
> can get the headers that way.
>
> Tom

comp.lang.ruby

Download only http headers

buunguyen

Tom Werner

buunguyen

x Login to ForumsZone