[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

hpricot or nokogiri?

goodieboy

1/9/2009 7:16:00 PM

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Any advice for me follks?

Matt

6 Answers

Ryan Davis

1/9/2009 10:54:00 PM

0


On Jan 9, 2009, at 11:16 , goodieboy wrote:

> OK, was completely sold on Hpricot and now am having my doubts. I
> can't seem to get to any of the docs (the site is down). Is it still
> being developed? Who are the developers? I love the API and really am
> hoping to use it...
>
> So then I tried out Nokogiri and it works well. The bug that Hpricot
> had (re-naming a node only names the open-tag) is not present in
> Nokogiri. Great! But it's built on libxml, which I don't know much
> about. It seems a little more heavy weight than Hpricot. I also heard
> that the main developer for libxml doesn't have much time to devote to
> the project.

hpricot drops the ball in a lot of ways and is much more heavyweight
than nokogiri. Parsing an 8 meg itunes xml file takes over a gig in
hpricot (according to my students) and nokogiri zipped right through it.

The libxml developer doesn't need to devote much time to the project
(assuming you mean libxml itself, not nokogiri). It is a very mature
library. On the other hand, hpricot has had a lot of open bugs for a
long time and they've not been touched one way or another. I find
Aaron Patterson very responsive to my bug reports (but I'm biased,
he's just down the street--look at the bug tracker on rubyforge for
less biased data).


Aaron Patterson

1/9/2009 11:01:00 PM

0

Hi Matt,

On Sat, Jan 10, 2009 at 04:16:22AM +0900, goodieboy wrote:
> OK, was completely sold on Hpricot and now am having my doubts. I
> can't seem to get to any of the docs (the site is down). Is it still
> being developed? Who are the developers? I love the API and really am
> hoping to use it...
>
> So then I tried out Nokogiri and it works well. The bug that Hpricot
> had (re-naming a node only names the open-tag) is not present in
> Nokogiri. Great! But it's built on libxml, which I don't know much
> about. It seems a little more heavy weight than Hpricot. I also heard
> that the main developer for libxml doesn't have much time to devote to
> the project.

Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

http://xm...

If you find bugs, we have a

* mailing list: http://rubyforge.org/mailman/listinfo/nok...
* IRC Channel on freenode: #nokogiri
* Ticketing system:
http://nokogiri.lighthouseapp.com/projects/19607-nokogir...
* RDoc: http://nokogiri.rubyforge.org...

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.

--
Aaron Patterson
http://tenderlovem...

goodieboy

1/11/2009 7:52:00 PM

0

On Jan 9, 6:01 pm, Aaron Patterson <aa...@tenderlovemaking.com> wrote:
> Hi Matt,
>
> On Sat, Jan 10, 2009 at 04:16:22AM +0900, goodieboy wrote:
> > OK, was completely sold on Hpricot and now am having my doubts. I
> > can't seem to get to any of the docs (the site is down). Is it still
> > being developed? Who are the developers? I love the API and really am
> > hoping to use it...
>
> > So then I tried out Nokogiri and it works well. The bug that Hpricot
> > had (re-naming a node only names the open-tag) is not present in
> > Nokogiri. Great! But it's built on libxml, which I don't know much
> > about. It seems a little more heavy weight than Hpricot. I also heard
> > that the main developer for libxml doesn't have much time to devote to
> > the project.
>
> Yes, Nokogiri is built on top of the libxml2 project from Gnome.
> libxml2 is actively developed and well supported since it is the XML
> parser used by the Gnome project:
>
>  http://xm...
>
> If you find bugs, we have a
>
> * mailing list:http://rubyforge.org/mailman/listinfo/nok...
> * IRC Channel on freenode: #nokogiri
> * Ticketing system:
>  http://nokogiri.lighthouseapp.com/projects/19607-nokogir...
> * RDoc:http://nokogiri.rubyforge.org...
>
> I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.
>
> --
> Aaron Pattersonhttp://tenderlovem...

This is great thank you. Definitely helps clear things up a bit. So
it's not just me... Hpricot has a few bugs that have been around for a
while. That's too bad :(

OK, for a quick Nokogiri question... is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

Thanks,
Matt

Aaron Patterson

1/11/2009 8:23:00 PM

0

On Mon, Jan 12, 2009 at 04:54:26AM +0900, matt mitchell wrote:
> On Jan 9, 6:01 pm, Aaron Patterson <aa...@tenderlovemaking.com> wrote:
> > Hi Matt,
> >
> > On Sat, Jan 10, 2009 at 04:16:22AM +0900, goodieboy wrote:
> > > OK, was completely sold on Hpricot and now am having my doubts. I
> > > can't seem to get to any of the docs (the site is down). Is it still
> > > being developed? Who are the developers? I love the API and really am
> > > hoping to use it...
> >
> > > So then I tried out Nokogiri and it works well. The bug that Hpricot
> > > had (re-naming a node only names the open-tag) is not present in
> > > Nokogiri. Great! But it's built on libxml, which I don't know much
> > > about. It seems a little more heavy weight than Hpricot. I also heard
> > > that the main developer for libxml doesn't have much time to devote to
> > > the project.
> >
> > Yes, Nokogiri is built on top of the libxml2 project from Gnome.
> > libxml2 is actively developed and well supported since it is the XML
> > parser used by the Gnome project:
> >
> >  http://xm...
> >
> > If you find bugs, we have a
> >
> > * mailing list:http://rubyforge.org/mailman/listinfo/nok...
> > * IRC Channel on freenode: #nokogiri
> > * Ticketing system:
> >  http://nokogiri.lighthouseapp.com/projects/19607-nokogir...
> > * RDoc:http://nokogiri.rubyforge.org...
> >
> > I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.
> >
> > --
> > Aaron Pattersonhttp://tenderlovem...
>
> This is great thank you. Definitely helps clear things up a bit. So
> it's not just me... Hpricot has a few bugs that have been around for a
> while. That's too bad :(
>
> OK, for a quick Nokogiri question... is it possible to ask a node if
> it responds to a certain xpath? Something like:
>
> matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

I can't think of a good xpathy way to do that from the current node.
You could do something like this:

matching = nodes.select { |n|
n.parent.xpath('./*[@class="plant"]').include?(n)
}

That might get kind of slow though. If you know that "class" is the
attribute you're looking for, you could just do something like this:

matching = nodes.select { |n| n['class'] == "plant" }

Hope that helps.

--
Aaron Patterson
http://tenderlovem...

Lance Bradley

2/12/2009 2:23:00 AM

0

I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees &nbsp; and
&#8230;. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX::Document
attr_accessor :rhtml
def initialize
@rhtml = ""
@keep_text = true
@keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
end

def start_element name, attrs = []
puts "start element called: " + name
if @keep_elements.include?(name)
puts "keeping: #{name}"
@rhtml << "<#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = false
end
end

def characters text
#@rhtml << @coder.decode( text ) if @keep_text
@rhtml << text if @keep_text
puts text
end

def end_element name
puts "end element called: " + name
if @keep_elements.include?(name)
@rhtml << "</#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = true
end
end

end

html = open(ARGV[0], 'r').collect { |l| l }.join

#coder = HTMLEntities.new
#html = coder.decode(html)

Tidy.path = '/usr/lib/libtidy-0.99.so.0'
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
#tidy.options.char_encoding = 'utf8'
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::Parser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")
--
Posted via http://www.ruby-....

Trans

2/12/2009 3:52:00 AM

0



On Feb 11, 9:22=A0pm, Lance Bradley <l...@ncebradley.org> wrote:
> I've been going through a similar situation with my current project. I
> was initially using Hpricot, and was very frustrated by the lack of
> documentation and some of the lingering bugs. I've now switched to
> nokogiri and have been very impressed with it.
>
> I'm now running into some of the robustness issues that are faced when
> you process data from the open web, like Dan alluded to. I'm using
> nokogiri's sax implementation, and I've ran into some problems with
> handling html entities, rather they are preserved or decoded into utf-8.
> In both cases, nokogiri will quit calling my start and end element
> handlers, but continue to call my character handler after an entity is
> seen. Specifically, I've noticed this behavior when it sees &nbsp; and
> &#8230;. Has anyone else experienced this and have any advice to share?
> I appreciate it!
> -lance
>
> (here's my code)
>
> class Nokogiri::XML::SAX::Document
> =A0 attr_accessor :rhtml
> =A0 def initialize
> =A0 =A0 @rhtml =3D ""
> =A0 =A0 @keep_text =3D true
> =A0 =A0 @keep_elements =3D %w{ br p img ul ol title li div table head bod=
y
> meta base blockquote }
> =A0 end
>
> =A0 def start_element name, attrs =3D []
> =A0 =A0 puts "start element called: " + name
> =A0 =A0 if @keep_elements.include?(name)
> =A0 =A0 =A0 puts "keeping: #{name}"
> =A0 =A0 =A0 @rhtml << "<#{name}>\n"
> =A0 =A0 end
> =A0 =A0 if ['script', 'style'].include? name
> =A0 =A0 =A0 @keep_text =3D false
> =A0 =A0 end
> =A0 end
>
> =A0 def characters text
> =A0 =A0 #@rhtml << @coder.decode( text ) if @keep_text
> =A0 =A0 @rhtml << text if @keep_text
> =A0 =A0 puts text
> =A0 end
>
> =A0 def end_element name
> =A0 =A0 puts "end element called: " + name
> =A0 =A0 if @keep_elements.include?(name)
> =A0 =A0 =A0 @rhtml << "</#{name}>\n"
> =A0 =A0 end
> =A0 =A0 if ['script', 'style'].include? name
> =A0 =A0 =A0 @keep_text =3D true
> =A0 =A0 end
> =A0 end
>
> end
>
> html =3D open(ARGV[0], 'r').collect { |l| l }.join
>
> #coder =3D HTMLEntities.new
> #html =3D coder.decode(html)
>
> Tidy.path =3D '/usr/lib/libtidy-0.99.so.0'
> xml =3D Tidy.open(:show_warnings=3D>true) do |tidy|
> =A0 tidy.options.output_xml =3D true
> =A0 #tidy.options.char_encoding =3D 'utf8'
> =A0 tidy.options.preserve_entities =A0=3D true
> =A0 xml =3D tidy.clean(html)
> end
>
> doc =3D Nokogiri::XML::SAX::Document.new
> parser =3D Nokogiri::XML::SAX::Parser.new(doc)
>
> parser.parse(xml)
>
> puts "doc:"
> puts doc.rhtml.gsub(/\n+/, "\n")

Note that there are also the libxml ruby bindings.

http://libxml.rub...

T.