Asp Forum - hpricot parsing

Marc Farber

4/19/2009 4:12:00 PM

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/sec... using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx
--
Posted via http://www.ruby-....

5 Answers

7stud --

4/19/2009 11:21:00 PM

Marc Farber wrote:
> Ruby newbie here
>
> Have successfully used hpricot to scrape correct <div> from desired page
> http://www.montgomeryadvertiser.com/sec... using
>
> doc = Hpricot(uri above)
> ...
> @grab1 = doc.search("//div[@class='article-bodytext']")
>
> target data is in following logical form
>
> <div>
> <h3>name of funeral home</h3>
> <p>deceased1</p>
> <div>advertising crap</div>
> <h3>funeral home 2</h3>
> <p>deceased 2</p>
> <p>deceased 3</p>
> </div>
>
> I'm struggling to iterate thru this div..
> I [want to insert a record into a table with each] record being a funeral home and person.
> I was thinking I could go thru each of the @grab1 elements and process
> according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end

--output:--
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3

--
Posted via http://www.ruby-....

7stud --

4/19/2009 11:41:00 PM

7stud -- wrote:
> h3_tags.each do |h3|
> elm = h3
>
> while elm = elm.next_sibling
> break if elm.name != 'p'
>
> puts h3.inner_text
> puts "\t #{elm.inner_text}"
> end
>
> end
>
>

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != 'p'

puts funeral_home
puts "\t #{elm.inner_text}"
end
end

--
Posted via http://www.ruby-....

Marc Farber

4/19/2009 11:50:00 PM

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc
--
Posted via http://www.ruby-....

Wang Jian

4/20/2009 2:04:00 AM

[Note: parts of this message were removed to make it a legal post.]

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.
I'd also be glad to know.
2009/4/20 Marc Farber <mrcfab3@gmail.com>

> Ruby newbie here
>
> Have successfully used hpricot to scrape correct <div> from desired page
> http://www.montgomeryadvertiser.com/sec... using
>
> doc = Hpricot(uri above)
> ...
> @grab1 = doc.search("//div[@class='article-bodytext']")
>
> target data is in following logical form
>
> <div>
> <h3>name of funeral home</h3>
> <p>deceased1</p>
> <div>advertising crap</div>
> <h3>funeral home 2</h3>
> <p>deceased 2</p>
> <p>deceased 3</p>
> </div>
>
> I'm struggling to iterate thru this div, plucking a array or hash where
> I can feed a database with each record being a funeral home and person.
> I was thinking I could go thru each of the @grab1 elements and process
> according to tag type and establish the "record" logic thru simple
> knowing that a new record starts with each new h3 tag.
>
> Any help for a newbie with first Ruby script?
>
>
> Thx
> --
> Posted via http://www.ruby-....
>
>

Phlip

4/20/2009 2:20:00 AM

Wang Jian wrote:

> Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
> found.

Try to write it. I hope I'm wrong, but I suspect that starting will be easy, and
hitting your own target XML will be easy...

....but making it generic enough to publish will be another story!

--
Phlip

comp.lang.ruby

hpricot parsing

Marc Farber

7stud --

7stud --

Marc Farber

Wang Jian

Phlip

x Login to ForumsZone