[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: Screen scraping an html text contents into a file

Martin DeMello

12/7/2005 8:49:00 AM

Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> I will throw something like this together in Ruby over
> the next days when I get some time and post it on
> RubyForge. I have already done this sort of stuff in
> Java and the concepts just really need a port.All we
> are looking at Basi's initial level of requirements is
> to send an HTTP get, and pipe the response to a file.

Nope, according to the OP's requirements, you also need to render the
html and spit out the rendered version as text, which makes lynx --dump
the right tool for the job. It'd be quite a big task to duplicate this
in ruby, I think.

martin
5 Answers

Steve Callaway

12/7/2005 9:03:00 AM

0



--- Martin DeMello <martindemello@yahoo.com> wrote:

> Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> > I will throw something like this together in Ruby
> over
> > the next days when I get some time and post it on
> > RubyForge. I have already done this sort of stuff
> in
> > Java and the concepts just really need a port.All
> we
> > are looking at Basi's initial level of
> requirements is
> > to send an HTTP get, and pipe the response to a
> file.
>
> Nope, according to the OP's requirements, you also
> need to render the
> html and spit out the rendered version as text,
> which makes lynx --dump
> the right tool for the job. It'd be quite a big task
> to duplicate this
> in ruby, I think.
>
> martin
>
>

By rendering the html, my interpretation of this was
that it is merely a question of stripping tags etc,
which can quickly be accomplished with gsub. Or am I
missing something?

rgds

Steve



__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com



Brian Schröder

12/7/2005 10:55:00 AM

0

On 07/12/05, Steve Callaway <sjc2000_uk@yahoo.com> wrote:
>
>
> --- Martin DeMello <martindemello@yahoo.com> wrote:
>
> > Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> > > I will throw something like this together in Ruby
> > over
> > > the next days when I get some time and post it on
> > > RubyForge. I have already done this sort of stuff
> > in
> > > Java and the concepts just really need a port.All
> > we
> > > are looking at Basi's initial level of
> > requirements is
> > > to send an HTTP get, and pipe the response to a
> > file.
> >
> > Nope, according to the OP's requirements, you also
> > need to render the
> > html and spit out the rendered version as text,
> > which makes lynx --dump
> > the right tool for the job. It'd be quite a big task
> > to duplicate this
> > in ruby, I think.
> >
> > martin
> >
> >
>
> By rendering the html, my interpretation of this was
> that it is merely a question of stripping tags etc,
> which can quickly be accomplished with gsub. Or am I
> missing something?
>
> rgds
>
> Steve
>

E.g. Tables and frames. So better use links2 or w3m for the task.

cheers,

Brian



--
http://ruby.brian-sch...

Stringed instrument chords: http://chordlist.brian-sch...


Steve Callaway

12/7/2005 12:35:00 PM

0

Ah, yeah, forgot all about those nasty little things.
Not insuperable but would certainly add an overhead to
handle them effectively.

Steve

--- Brian Schröder <ruby.brian@gmail.com> wrote:

> On 07/12/05, Steve Callaway <sjc2000_uk@yahoo.com>
> wrote:
> >
> >
> > --- Martin DeMello <martindemello@yahoo.com>
> wrote:
> >
> > > Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> > > > I will throw something like this together in
> Ruby
> > > over
> > > > the next days when I get some time and post it
> on
> > > > RubyForge. I have already done this sort of
> stuff
> > > in
> > > > Java and the concepts just really need a
> port.All
> > > we
> > > > are looking at Basi's initial level of
> > > requirements is
> > > > to send an HTTP get, and pipe the response to
> a
> > > file.
> > >
> > > Nope, according to the OP's requirements, you
> also
> > > need to render the
> > > html and spit out the rendered version as text,
> > > which makes lynx --dump
> > > the right tool for the job. It'd be quite a big
> task
> > > to duplicate this
> > > in ruby, I think.
> > >
> > > martin
> > >
> > >
> >
> > By rendering the html, my interpretation of this
> was
> > that it is merely a question of stripping tags
> etc,
> > which can quickly be accomplished with gsub. Or am
> I
> > missing something?
> >
> > rgds
> >
> > Steve
> >
>
> E.g. Tables and frames. So better use links2 or w3m
> for the task.
>
> cheers,
>
> Brian
>
>
>
> --
> http://ruby.brian-sch...
>
> Stringed instrument chords:
> http://chordlist.brian-sch...
>
>




__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com



Martin DeMello

12/7/2005 1:35:00 PM

0

Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> >
>
> By rendering the html, my interpretation of this was
> that it is merely a question of stripping tags etc,
> which can quickly be accomplished with gsub. Or am I
> missing something?

Even without things like tables, the significance of various whitespace
elements (space, tab, newline) in html is very different from its
significance in the rendered page. e.g. the following can't be done by
just stripping tags:

<ul><li>one
two</li><li>three<li>four</ul>

martin

Ryan Leavengood

12/7/2005 4:45:00 PM

0

WWW::Mechanize can do most of what is needed, except for the dumping
of the HTML as text. As others have said, what we really need is some
kind of HTML to text renderer. There has got to be gobs of C or C++
code out there that does this...how hard would it be to make a Ruby C
extension for this? Hash anyone ever thought about making a nice Ruby
extension for Gecko or even the HTML renderers in lynx or w3m?

Ryan