[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Hpricot getting a table

lrlebron@gmail.com

4/18/2007 3:43:00 PM

I am currently trying to scrape some data from the following web page

I am using some hpricot code that looks like this
@doc = Hpricot(open(strLink))

@doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/
table") do |data|
puts data
end

At this point data contains html that looks like this

<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>

More stuff continues ......

I want capture each of these four tables individually for further
processing. I have tried a variety of methods but nothing seems to
work.

Thanks,
Luis

4 Answers

Drew Raines

4/18/2007 4:12:00 PM

0

lrlebron@gmail.com wrote:

> At this point data contains html that looks like this
>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
>
> More stuff continues ......
>
> I want capture each of these four tables individually for further
> processing. I have tried a variety of methods but nothing seems to
> work.

Would something as simple as this work? I'm not sure how complex
your tables get.

#!/usr/bin/env ruby

require "hpricot"

doc = Hpricot("<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>")

(doc/"table").map {|t| puts t.to_html}

This outputs:

"<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
"<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
"<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
"<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"

Note that there's an Hpricot mailing list at
http://code.whytheluckystiff.ne... that might be a more
appropriate forum for these questions.

-Drew

lrlebron@gmail.com

4/18/2007 6:53:00 PM

0

On Apr 18, 11:11 am, Drew Raines <aarai...@gmail.com> wrote:
> lrleb...@gmail.com wrote:
> > At this point data contains html that looks like this
>
> > <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> > <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> > <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> > <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
>
> > More stuff continues ......
>
> > I want capture each of these four tables individually for further
> > processing. I have tried a variety of methods but nothing seems to
> > work.
>
> Would something as simple as this work? I'm not sure how complex
> your tables get.
>
> #!/usr/bin/env ruby
>
> require "hpricot"
>
> doc = Hpricot("<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>")
>
> (doc/"table").map {|t| puts t.to_html}
>
> This outputs:
>
> "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
> "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
> "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
> "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
>
> Note that there's an Hpricot mailing list athttp://code.whytheluckystiff.net/hp... might be a more
> appropriate forum for these questions.
>
> -Drew

Thanks,

This gets me a lot closer to what I need.
I'm having some problems with syntax. If I'm reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/"tables").map

And then access each table individually. For example

arrTables[0]

Luis

Peter Szinek

4/18/2007 7:29:00 PM

0

lrlebron@gmail.com wrote:
> I am currently trying to scrape some data from the following web page
>
> I am using some hpricot code that looks like this
> @doc = Hpricot(open(strLink))
>
> @doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/
> table") do |data|
> puts data
> end
>
> At this point data contains html that looks like this
>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
> <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
>
> More stuff continues ......
>
> I want capture each of these four tables individually for further
> processing. I have tried a variety of methods but nothing seems to
> work.

What are you trying to do exactly? What should be the result?
Could you please provide some real data, because these 'stuff' do not
make too much sense :-)


Thanks,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby



Drew Raines

4/18/2007 7:55:00 PM

0

lrlebron@gmail.com wrote:

> This gets me a lot closer to what I need.
> I'm having some problems with syntax. If I'm reading the docs
> correctly map returns an array. So I should be able to do something
> like
>
> arrTables = (doc/"tables").map
>
> And then access each table individually. For example
>
> arrTables[0]

Array#map wasn't particularly relevant in my example; I just used it
to iterate #puts over the result of (doc/"table").

The real lesson to glean from my response is that:

(doc/"table")

is much nicer than:

doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/table")

....which, like Peter alluded to, is fairly meaningless to us because
we don't know what the full original HTML looks like. You can grab
the <table>s from the snippet you provided with just a CSS-style
search[1].

FWIW, if you have any control over the markup, you can simply add
some unique classes to the tables you want:

<table class="foo">...</table>
<table class="foo">...</table>
<table class="foo">...</table>

Then do:

(doc/"table.foo")

That'll work regardless of how many <table>s are on the page.

-Drew

Footnotes:
[1] http://lnk.nu/code.whytheluckysti...