[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

General Nokogiri problem

Srijayanth Sridhar

5/7/2009 6:45:00 AM

[Note: parts of this message were removed to make it a legal post.]

Hello,

On several sites(probably malformed HTML/JavaScript/XML/general parsing
hell) I have the following problem.

For ex:

moonwolf@trantor:~/ruby$ irb
irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
require r }
=> ["rubygems", "nokogiri", "hpricot", "open-uri"]
irb(main):002:0> doc=Nokogiri(open("http://maps.google....))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>

irb(main):003:0> doc/"a"
=>

Same with Nokogiri.Hpricot:

irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google....))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>

However with regular Hpricot:

irb(main):009:0> (Hpricot(open("http://maps.google....))/"a").size
=> 53
(the full post of course is too long, so just showed something simpler)


Hpricot by itself of course works. I tried looking and there's not much by
way of documentation or blogs on something like this.

Any suggestions/explanations will be welcome as I like Nokogiri's speed very
much.

I am using:

moonwolf@trantor:~/ruby$ gem list --local | grep -i nokogiri
nokogiri (1.2.3)
moonwolf@trantor:~/ruby$ ruby --version
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]


Jayanth

3 Answers

Aaron Patterson

5/7/2009 7:03:00 AM

0

On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth Sridhar wrote:
> Hello,
>
> On several sites(probably malformed HTML/JavaScript/XML/general parsing
> hell) I have the following problem.
>
> For ex:
>
> moonwolf@trantor:~/ruby$ irb
> irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
> require r }
> => ["rubygems", "nokogiri", "hpricot", "open-uri"]
> irb(main):002:0> doc=Nokogiri(open("http://maps.google....))
> => <?xml version="1.0"?>
> <!DOCTYPE html>
> <html/>
>
> irb(main):003:0> doc/"a"
> =>
>
> Same with Nokogiri.Hpricot:
>
> irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google....))
> => <?xml version="1.0"?>
> <!DOCTYPE html>
> <html/>
>
> However with regular Hpricot:
>
> irb(main):009:0> (Hpricot(open("http://maps.google....))/"a").size
> => 53
> (the full post of course is too long, so just showed something simpler)
>
>
> Hpricot by itself of course works. I tried looking and there's not much by
> way of documentation or blogs on something like this.
>
> Any suggestions/explanations will be welcome as I like Nokogiri's speed very
> much.

Nokogiri detects the XML header and parses it as XML. If you force it
to use the HTML parser, you may be more successfull:

>> (Nokogiri::HTML(open("http://maps.google....))/'a').length
=> 53
>>

--
Aaron Patterson
http://tenderlovem...

Srijayanth Sridhar

5/7/2009 7:06:00 AM

0

[Note: parts of this message were removed to make it a legal post.]

Thanks Aaron.

Jayanth

On Thu, May 7, 2009 at 12:32 PM, Aaron Patterson <aaron@tenderlovemaking.com
> wrote:

> On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth Sridhar wrote:
> > Hello,
> >
> > On several sites(probably malformed HTML/JavaScript/XML/general parsing
> > hell) I have the following problem.
> >
> > For ex:
> >
> > moonwolf@trantor:~/ruby$ irb
> > irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
> > require r }
> > => ["rubygems", "nokogiri", "hpricot", "open-uri"]
> > irb(main):002:0> doc=Nokogiri(open("http://maps.google....))
> > => <?xml version="1.0"?>
> > <!DOCTYPE html>
> > <html/>
> >
> > irb(main):003:0> doc/"a"
> > =>
> >
> > Same with Nokogiri.Hpricot:
> >
> > irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google....))
> > => <?xml version="1.0"?>
> > <!DOCTYPE html>
> > <html/>
> >
> > However with regular Hpricot:
> >
> > irb(main):009:0> (Hpricot(open("http://maps.google....))/"a").size
> > => 53
> > (the full post of course is too long, so just showed something simpler)
> >
> >
> > Hpricot by itself of course works. I tried looking and there's not much
> by
> > way of documentation or blogs on something like this.
> >
> > Any suggestions/explanations will be welcome as I like Nokogiri's speed
> very
> > much.
>
> Nokogiri detects the XML header and parses it as XML. If you force it
> to use the HTML parser, you may be more successfull:
>
> >> (Nokogiri::HTML(open("http://maps.google....))/'a').length
> => 53
> >>
>
> --
> Aaron Patterson
> http://tenderlovem...
>
>

Srijayanth Sridhar

5/7/2009 7:08:00 AM

0

[Note: parts of this message were removed to make it a legal post.]

Whoops,

irb(main):015:0> (Nokogiri::HTML(open("http://maps.g...
"))/'a').length
=> 0

Not sure what the deal is.

Jayanth

On Thu, May 7, 2009 at 12:35 PM, Srijayanth Sridhar <srijayanth@gmail.com>wrote:

> Thanks Aaron.
>
> Jayanth
>
>
> On Thu, May 7, 2009 at 12:32 PM, Aaron Patterson <
> aaron@tenderlovemaking.com> wrote:
>
>> On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth Sridhar wrote:
>> > Hello,
>> >
>> > On several sites(probably malformed HTML/JavaScript/XML/general parsing
>> > hell) I have the following problem.
>> >
>> > For ex:
>> >
>> > moonwolf@trantor:~/ruby$ irb
>> > irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
>> > require r }
>> > => ["rubygems", "nokogiri", "hpricot", "open-uri"]
>> > irb(main):002:0> doc=Nokogiri(open("http://maps.g..."))
>> > => <?xml version="1.0"?>
>> > <!DOCTYPE html>
>> > <html/>
>> >
>> > irb(main):003:0> doc/"a"
>> > =>
>> >
>> > Same with Nokogiri.Hpricot:
>> >
>> > irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.g..."))
>> > => <?xml version="1.0"?>
>> > <!DOCTYPE html>
>> > <html/>
>> >
>> > However with regular Hpricot:
>> >
>> > irb(main):009:0> (Hpricot(open("http://maps.g..."))/"a").size
>> > => 53
>> > (the full post of course is too long, so just showed something simpler)
>> >
>> >
>> > Hpricot by itself of course works. I tried looking and there's not much
>> by
>> > way of documentation or blogs on something like this.
>> >
>> > Any suggestions/explanations will be welcome as I like Nokogiri's speed
>> very
>> > much.
>>
>> Nokogiri detects the XML header and parses it as XML. If you force it
>> to use the HTML parser, you may be more successfull:
>>
>> >> (Nokogiri::HTML(open("http://maps.g..."))/'a').length
>> => 53
>> >>
>>
>> --
>> Aaron Patterson
>> http://tenderlovem...
>>
>>
>