Asp Forum - Ruby (X)HTML Parser?

Andrei Maxim

9/25/2006 12:22:00 PM

Hi guys,

I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.

I found:

* ymHTML at http://www.yoshidam.net...
* RAA at http://raa.ruby-lang.org/project/html...

but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).

Is there a more "standard" way to parse HTML pages in Ruby?

Thanks,

Andrei

5 Answers

Jordan Elver

9/25/2006 2:01:00 PM

There's Hpricot. Haven't used it myself though.

http://code.whytheluckystiff.ne...

Alex Young

9/25/2006 2:44:00 PM

Andrei Maxim wrote:
> Hi guys,
>
> I'm starting to learn Ruby and I was thinking about a little app so I can
> get things started as quickly as possible. Since I'm an avid blog reader,
> the first thing that went though my mind was a small app that would extract
> the RSS or Atom feed from a web page, giving the URL.
>
> My first choice were regexps but I'm thinking that my little app my grow a
> little bit more in the not-so-distant future and I might be doing more than
> just extracting feeds.
>
> I found:
>
> * ymHTML at http://www.yoshidam.net...
> * RAA at http://raa.ruby-lang.org/project/html...
>
> but they don't look really standard and RAA doesn't look like it's currently
> maintained. I've also heard that there's a Rails HTML parser but I couldn't
> find more info (an pro'lly I'll ask on one of the Rails list).
>
> Is there a more "standard" way to parse HTML pages in Ruby?
The closest you'll find to a standard is REXML, which is an XML parser
that ships in the stdlib. You'll want to throw your HTML through Tidy
first, though - but that's an easy install.

There are a couple of alternatives: Hpricot and html-parser spring
instantly to mind.

If you're doing feed parsing, you probably also want to check out feedtools.

--
Alex

MonkeeSage

9/25/2006 2:46:00 PM

Jordan Elver wrote:
> There's Hpricot. Haven't used it myself though.
>
> http://code.whytheluckystiff.ne...

Hpricot is *really* nice. Also, there is the standard REXML (built-in
since 1.8). See the tutorial for some ideas how to use it:
http://www.germane-software.com/software/rexml/docs/tut...

Regards,
Jordan

why the lucky stiff

9/25/2006 4:58:00 PM

On Mon, Sep 25, 2006 at 11:01:18PM +0900, Jordan Elver wrote:
> There's Hpricot. Haven't used it myself though.
>
> http://code.whytheluckystiff.ne...

If you decide to us Hpricot, I'd recommend the latest 0.4.52 gems:

gem install hpricot --source code.whytheluckystiff.net

There's been a good deal of patching over the past week and a new release is
very close.

_why

Bob Aman

9/25/2006 5:50:00 PM

> > Since I'm an avid blog reader,
> > the first thing that went though my mind was a small app that would extract
> > the RSS or Atom feed from a web page, giving the URL.
>
> If you're doing feed parsing, you probably also want to check out feedtools.

Well... he probably won't learn much from the FeedTools code, but it is
convenient for this sort of thing:

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> require 'feed_tools'
=> true
irb(main):003:0> feed = FeedTools::Feed.open('http://intertwingly...)
=> #<FeedTools::Feed:0x135d8fe URL:http://www.intertwingly.net/blog/inde...
irb(main):004:0> feed.title
=> "Sam Ruby"
irb(main):005:0> feed.subtitle
=> "It's just data"

Cheers,
Bob Aman
--
AIM: sporkmonger
Jabber: sporkmonger@jabber.org

comp.lang.ruby