Asp Forum - Hpricot/Rubyful Soup comparison

Wes Gamble

11/21/2006 10:27:00 PM

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

--
Posted via http://www.ruby-....

18 Answers

lrlebron@gmail.com

11/22/2006 2:33:00 AM

I've used both Hpricot and Rubyful Soup to parse the Google News page
and found Hpricot to be much faster.

Luis
Wes Gamble wrote:
> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster for an
> average sized HTML page and b) preserves the original HTML better.
>
> Thanks,
> Wes
>
> --
> Posted via http://www.ruby-....

Peter Szinek

11/22/2006 7:54:00 AM

Wes Gamble wrote:
> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster for an
> average sized HTML page and
I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot. I am absolutely sure about this. I am doing things
with HPricot which should be extremely slow (e.g. traversing the whole
tree and doing expensive operations on all Hpricot::Elements) yet
HPricot is surprisingly fast. Rubyful is nowhere near.

b) preserves the original HTML better.
Hmm this I don't know, but I guess the term 'preserves HTML better'
should be defined first with some metrics or something ( deviance from
the HTML standard? ). There are a lot of so badly formed HTML pages,
than even a human would come up with multiple solutions for their
correction.

I think the only real-life quality meter is to process your pages with
both of them and see which one yields better results. I did not play too
much with RubyfulSoup but I am writing a quite serious screen scraping
framework based on Hpricot, and so far I had no real problems - and I am
doing every kind of weird things.

Cheers,
Peter

__
http://www.rubyra...

ramalho@gmail.com

11/22/2006 9:12:00 AM

On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:
> I did not do any benchmarks, but I am scraping a lot of relatively big
> pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
> slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

Also, RubyfulSoup aims to be very resilient to malformed markup, so it
must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on, but in my experience RubyfulSoup
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

Cheers,

Luciano

Peter Szinek

11/22/2006 11:04:00 AM

Luciano Ramalho wrote:
> On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:
>> I did not do any benchmarks, but I am scraping a lot of relatively big
>> pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
>> slower than HPricot.
>
> HPricot is partially written in C, so it should be faster than a
> pure-Ruby lib like RubyfulSoup.
true

> Also, RubyfulSoup aims to be very resilient to malformed markup,
So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

so it
> must resort to heuristics that have a performance cost. I don't know
> fow HPricot handles HTML or XML with really serious flaws like tags
> that open but never close and so on,
This concretely is absolutely OK. Maybe we would need a list of serious
problems and see how Hpricot vs RubyfulSoup is handling them. From what
I have seen, HPricot did not have any problems with any page...

> has managed to deal amazingly well with such problems. If you need to
> parse low quality markup, the performance penalty of RubyfulSoup may
> be well worth the price.
I am still not sure what are the added benefits of RubyfulSoup parsing
over HPricot (although I am not claiming that there are none) - I would
like to see a real serious comparison to decide this...

Peter

__
http://www.rubyra...

Gregory Seidman

11/22/2006 12:41:00 PM

On Wed, Nov 22, 2006 at 08:03:54PM +0900, Peter Szinek wrote:
} Luciano Ramalho wrote:
} > On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:
[...]
} > Also, RubyfulSoup aims to be very resilient to malformed markup,
} So it's HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
} whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I don't
} > know fow HPricot handles HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From what
} I have seen, HPricot did not have any problems with any page...

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken HTML
if that's your intent, but since I am mostly using HPricot for sanitization
I just set the close tags to nil so the output closes with the correct tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I would
} like to see a real serious comparison to decide this...

I haven't tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects. It
doesn't try to behave like a real DOM, which means that it can use arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
--Greg

Ross Bamford

11/22/2006 1:07:00 PM

On Tue, 21 Nov 2006 22:27:15 -0000, Wes Gamble <weyus@att.net> wrote:

> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster for an
> average sized HTML page and b) preserves the original HTML better.
>

I recently did a small head-to-head with RubyfulSoup, Hpricot, and the
up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding to
the libxml2 HTML parser. Running against the RubyfulSoup homepage (perhaps
ironically, it's pretty badly formed) over 100 iterations, the attached
benchmark gave out the following results. Each benchmark is parsing the
original HTML and then getting back a specific node set (Hpricot and
libxml2 using Xpath, RubyfulSoup using it's own query API):

user system total real
rubyful soup - simple 25.900000 0.710000 26.610000 ( 26.669350)

user system total real
rubyful soup - trickier 26.220000 0.010000 26.230000 ( 26.252975)

user system total real
hpricot - simple xpath 7.930000 0.000000 7.930000 ( 7.950092)

user system total real
hpricot - trickier xpath 8.200000 0.010000 8.210000 ( 8.212230)

user system total real
libxml2 - simple xpath 0.900000 0.000000 0.900000 ( 0.899329)

user system total real
libxml2 - trickier xpath 0.940000 0.000000 0.940000 ( 1.217441)

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML. There were minor differences in the XML produced, and from a
(biased, nitpicking) spec point of view I think libxml2's output is
slightly more 'proper' (self closing tags, etc). RubyfulSoup on the other
hand seemed to have a few inconsistencies - it would occasionally lose tag
attributes, and sometimes return varying results to the same query.

As for feature support, well, I don't want to rain on anyone's parade but
the libxml HTML parser outputs an XML::Document with which you can
transparently use all of libxml2's (many) features ... ;) I couldn't get
XPath functions to work with Hpricot, but then I'm not sure how complete
an XPath implementation it's aiming for, and apart from that it seems
pretty solid. OTOH RubyfulSoup has no Xpath support at all :(

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Bob Hutchison

11/22/2006 1:42:00 PM

On 21-Nov-06, at 5:27 PM, Wes Gamble wrote:

> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster
> for an
> average sized HTML page and b) preserves the original HTML better.

I switched from Rubyful Soup to Hpricot a while ago. The reason was
performance on 1000-2000 character html chunks -- I didn't do a
benchmark because there just was no need to... Hpricot is *a lot*
faster.

I have no idea which preserves html better, I'm only using them to
find specific bits of the html (e.g. links, images, a few other
things). I do not use either to transform the input html, I *always*
keep the input as it was. In all cases I have html in a string that I
give to the parser, I do know that with Rubyful Soup it was
absolutely necessary to dup the string first or you were liable to
have changes made to the input string.

Cheers,
Bob

>
> Thanks,
> Wes
>
> --
> Posted via http://www.ruby-....
>

----
Bob Hutchison -- blogs at <http://www.rec...
hutch/>
Recursive Design Inc. -- <http://www.rec...>
Raconteur -- <http://www.raconteur...
xampl for Ruby -- <http://rubyforge.org/projects/...

ramalho@gmail.com

11/22/2006 2:16:00 PM

On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:
> > Also, RubyfulSoup aims to be very resilient to malformed markup,
> So it's HPricot. HPricot is not just a HTML parser which can parse
> (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
> whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
> it is a fact that HPricot is handling malformed pages very well.

Thanks for the input, Peter. From your opinion and other´s, it seems
HPricot is the best option. Coming from Python, I was used to
BeautifulSoup, from which RubyfulSoup derived, and I was very happy
with it. But if we can have the same benefits with better performance,
then it´s a no-brainer!

Cheers,

Luciano

Wes Gamble

11/22/2006 3:29:00 PM

Thanks for all of the comments.

I was pretty sure that Hpricot was faster since it is partially written
in C, but it's nice to hear a resounding "YES" on that topic.

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

Some malformed HTML is handled fine by browsers, so I'd like to preserve
the original HTML regardless of its quality. If Hpricot will not only
parse my HTML quickly, but also not fix the HTML on the way out (dumping
the parse tree), that would be ideal.

Again, thanks for all of the discussion - it's quite helpful.

Wes

--
Posted via http://www.ruby-....

_why

11/22/2006 5:22:00 PM

On Thu, Nov 23, 2006 at 12:28:30AM +0900, Wes Gamble wrote:
> My concern about "preserving original markup" has to do with this
> application I'm writing, which grabs a page and then tries to display
> it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
> but it always attempts to fix it when I went to write the parse tree.
> Which can cause problems when you try to redisplay the HTML.

I totally agree with you regarding preserving the original markup. In fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end tags.
* `to_original_html` which outputs the original document (as close as it can)
with your modifications made.

So, for example, I use the `to_original_html` method in MouseHole, which is a
scriptable personal HTTP proxy (sort of like greasemonkey). Some pages (like
Boing Boing, for instance) completely break if you try to fix up the HTML. But
this new method can successfully remove stuff and alter stuff without turning the
whole page upside-down.

_why

comp.lang.ruby

Hpricot/Rubyful Soup comparison

Wes Gamble

lrlebron@gmail.com

Peter Szinek

ramalho@gmail.com

Peter Szinek

Gregory Seidman

Ross Bamford

Bob Hutchison

ramalho@gmail.com

Wes Gamble

_why

x Login to ForumsZone