Ross Bamford
11/22/2006 1:07:00 PM
On Tue, 21 Nov 2006 22:27:15 -0000, Wes Gamble <weyus@att.net> wrote:
> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster for an
> average sized HTML page and b) preserves the original HTML better.
>
I recently did a small head-to-head with RubyfulSoup, Hpricot, and the
up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding to
the libxml2 HTML parser. Running against the RubyfulSoup homepage (perhaps
ironically, it's pretty badly formed) over 100 iterations, the attached
benchmark gave out the following results. Each benchmark is parsing the
original HTML and then getting back a specific node set (Hpricot and
libxml2 using Xpath, RubyfulSoup using it's own query API):
user system total real
rubyful soup - simple 25.900000 0.710000 26.610000 ( 26.669350)
user system total real
rubyful soup - trickier 26.220000 0.010000 26.230000 ( 26.252975)
user system total real
hpricot - simple xpath 7.930000 0.000000 7.930000 ( 7.950092)
user system total real
hpricot - trickier xpath 8.200000 0.010000 8.210000 ( 8.212230)
user system total real
libxml2 - simple xpath 0.900000 0.000000 0.900000 ( 0.899329)
user system total real
libxml2 - trickier xpath 0.940000 0.000000 0.940000 ( 1.217441)
In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML. There were minor differences in the XML produced, and from a
(biased, nitpicking) spec point of view I think libxml2's output is
slightly more 'proper' (self closing tags, etc). RubyfulSoup on the other
hand seemed to have a few inconsistencies - it would occasionally lose tag
attributes, and sometimes return varying results to the same query.
As for feature support, well, I don't want to rain on anyone's parade but
the libxml HTML parser outputs an XML::Document with which you can
transparently use all of libxml2's (many) features ... ;) I couldn't get
XPath functions to work with Hpricot, but then I'm not sure how complete
an XPath implementation it's aiming for, and apart from that it seems
pretty solid. OTOH RubyfulSoup has no Xpath support at all :(
--
Ross Bamford - rosco@roscopeco.remove.co.uk