Asp Forum - Re: hrpicot - cant extract what i want from page

Dan Diebolt

3/28/2008 10:42:00 AM

[Note: parts of this message were removed to make it a legal post.]

Firebug puts in tbody's into xpath's that reach into tables even if the <tbody> tag is not in the html source. Try removing the tbody path and debug using shorter xpaths to initially address content further up in the hierarchy.

You might have some success addressing text nodes combined with some subsequent regexp processing:

b = doc.search("//text()")

I think you might be more successful using a css selector instead of an xpath selector. To overcome hpricot not supporting all xpath axes you can sometimes find a way to address the elements with a clever css selector.

It can be a challenge to use hpricot with malformed html or if there are no containers wrapping items that otherwise appear visually as a list or table. I haven't tried it yet, but running the html through something like tidy before parsing so might create some of the missing structure.

4 Answers

Thomas Wieczorek

3/28/2008 11:10:00 AM

On Fri, Mar 28, 2008 at 11:42 AM, Dan Diebolt <dandiebolt@yahoo.com> wrote:
> Firebug puts in tbody's into xpath's that reach into tables even if the <tbody> tag is not in the html source. Try removing the tbody path and debug using shorter xpaths to initially address content further up in the hierarchy.
>

Yes, Firefox does it to make it more (X)HTML-conform. It took me a
while to get the hang of it. You might download the page using
open-uri and open it with your favourite editor, search the text and
work your way up through the tags.
Most sites don't use <tbody>, so just try it without it.

Adam Akhtar

3/28/2008 4:47:00 PM

ok i have tried taking out the tbody tags completely and got some of the
text back. Ill experiment to see if i can get all of it.

Re: Tidy

I installed the gem and i got the example code

require 'tidy'
Tidy.path = '/usr/lib/libtidy.so'
html = '<html><title>title</title>Body</html>'
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
puts tidy.options.show_warnings
xml = tidy.clean(html)
puts tidy.errors
puts tidy.diagnostics
xml
end
puts xml

now i have to change the path to whereever the lib is...well i foudn
tidys folder in my lib directory and changed the above to this

Tidy.path = 'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib'

and its complaining saying no such file... i tried

Tidy.path =
'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib.rb'

as thats the proper extension of the tidylib file but again it wont
work.

I cant find any tidylib file with an extenision .so

banging my head even more now ;-)

--
Posted via http://www.ruby-....

Adam Akhtar

3/28/2008 4:55:00 PM

just downloaded a dll which i needed. Why doesnt that come with the
******* gem.
--
Posted via http://www.ruby-....

Phlip

3/28/2008 5:50:00 PM

Adam Akhtar wrote:

> ok i have tried taking out the tbody tags completely and got some of the
> text back. Ill experiment to see if i can get all of it.
>
> Re: Tidy

Here are indirect tips on scraping HTML with Hpricot or REXML:

http://www.oreillynet.com/onlamp/blog/2007/08/assert_hpri...
http://www.oreillynet.com/onlamp/blog/2007/08/xpath_checker_and_assert_...

"indirect" because they focus on unit testing, not product-side code!

I would not use the Tidy plugin for this situation. I would...

- write the HTML to a Tempfile
- call xhtml = `tidy -asxhtml -m #{filename}`
- use Libxml or REXML to read the output document
- harvest the data with XPath.

That's a lot of work - and you certainly don't need it all if your code has
no need for absolute robustness - but it gets over these impediments:

- Hpricot can't do elaborate XPaths
- REXML and Libxml can't forgive ill-formed HTML*
- the Tidy plugin is poorly productized.

* actually, I have not yet researched Libxml enough to learn how to turn its
forgiveness knobs.

> banging my head even more now ;-)

Welcome to Ruby!

--
Phlip

comp.lang.ruby

Re: hrpicot - cant extract what i want from page

Dan Diebolt

Thomas Wieczorek

Adam Akhtar

Adam Akhtar

Phlip

x Login to ForumsZone