[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

hpricot and xpath doesn't work like they should ?!?

mrpink

7/29/2007 6:15:00 PM

hi,
I wanted to write me a little console tv-guide with ruby and hpricot. I
installed the firefox xpath checker plugin and went to
http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEv... . Then
I checked the xpath of these senders fields like ZDF and got:

/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

so I tried to parse the website for this and output the hits but I don't
get any output. Here's the code:

#!/usr/bin/env ruby

$Verbose = true

require 'hpricot'
require 'net/http'

url =
URI.parse('http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEv...')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}

tv = Hpricot(res.body)
tv.search("/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]").each
{ |a| puts a}

#eof


Am I using hpricot in the wrong way? I thought it could handle xpaths?


--
greets

one must still have chaos in oneself to be able to
give birth to a dancing star
4 Answers

Phlip

7/29/2007 7:35:00 PM

0

anansi wrote:

> Am I using hpricot in the wrong way? I thought it could handle xpaths?

Briefly, I suspect Hpricot uses an XPath subset invented on the fly to
permit querying into the HTML node space.

(This isn't a bad thing; the alternative, REXML::XPath, cannot handle some
well-formed XHTML [according to Tidy], and certainly can't handle
traditional HTML.

(BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require
'hpricot' refused to find it. This might indicate a broken .so file, so I
switched to Windows.)

The best way to use XPath is to locate tags by unique id=''. (The page you
used abuses the IDs, as CLASSes, so it's ill-formed. But that's not your
problem here.)

Don't use long XPath chains (even if an XPath visualizer provides them),
because these locate things by incidental features that could change when
you hit the page again. Table elements could come and go on the fly.

When I installed that XPath Checker (thanks for pointing it out!) and hit
that page, your XPath selects ZDF, so this implicates Hpricot.

Let's find a workaround. If I want to hit, say, "Hotel Zack und Cody", I use
Firebug's Inspect Element context menu feature, and see that blurb has a <td
title="19:45 Hotel Zack und Cody">. So if I XPath for things like that, we
get:

//td[ @title ]

That sweeps for every td with a title attribute. (The View XPath feature
should have an option to find minimal and unique paths based on attributes,
not long obsessive paths based on indices.)

And that works in Hpricot, too, to select every cell with a title. Further
poking and parsing should get you the raw TV listings.

tv.search("//td[ @title ]").each{ |a| p a}

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data
feeds available somewhere?

--
Phlip
http://www.oreilly.com/catalog/9780...
"Test Driven Ajax (on Rails)"
assert_xpath, assert_javascript, & assert_ajax


mrpink

7/29/2007 8:06:00 PM

0

Phlip wrote:
> BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
> data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I'm not aware of any of them ..



--
greets

one must still have chaos in oneself to be able to
give birth to a dancing star

Phlip

12/10/2010 2:36:00 PM

0

On Dec 9, 3:27 pm, "Killing, Inc." <i.am.killing....@gmail.com> wrote:

> >http://www.rawstory.com/rs/2010/12/dems-gop-government-heal......
>
> Typical left-wing non-news site.

Oh, hey, maybe Soros pays it 10x the going rate for websites, to spout
leftist nonsense. Didja ever think of THAT, huh?

Killing, Inc.

12/10/2010 7:35:00 PM

0

On Dec 10, 10:42 am, "5847 Dead, 990 since 1/20/09" <d...@gone.com>
wrote:
> On Fri, 10 Dec 2010 06:35:56 -0800, Phlip wrote:
> > On Dec 9, 3:27 pm, "Killing, Inc." <i.am.killing....@gmail.com> wrote:
>
> >> >http://www.rawstory.com/rs/2010/12/dems-gop-government-he...
> re...
>
> >> Typical left-wing non-news site.
>
> > Oh, hey, maybe Soros pays it 10x the going rate for websites, to spout
> > leftist nonsense. Didja ever think of THAT, huh?
>
> Oh, don't get him started.  He's already stupid enough.

Can't defend your 3rd grade attack on defenders of the free market
from the Democretin's fascist takeover of health care, I see.
LOL! You shutter so quickly in my presence.
I knew you would fail.