Michael Neumann
1/23/2005 5:06:00 PM
James Britt wrote:
> Michael Neumann wrote:
>
>> James Britt wrote:
>>
>>> I had to hack Mechanize to have it grab 'p' elements, but it is dead
>>> easy to do.
>>
>>
>>
>> What exactly do you had to hack? If it's worth, I'll add it to the lib.
>
>
>
> At first, I just used the built-in 'links' property to get the search
> result links. That sort of worked; I could get an array of URLs, but
> they had no descriptive context. Looking at the HTML coming back from
> Google I saw I really needed the 'p' elements that held the search
> result URL + the description.
I've added an find_all_recursive method to REXML::Node.
This should just return all paragraph nodes:
root.find_all_recursive {|n| n.name == 'p'}
> As best I could tell, the Page object has only a few built-in arrays
> (links, forms, maybe another, I don't recall) that get populated when
> calling parse_html. Adding another array, and telling parse_html to
> populate this array, was super easy.
I add more if there's need for it.
> In retrospect I think I could have done some sort of Xpath-thing over
> the tree of node held by the Page object, but I just took what seemed to
> be the easiest route at the time. (besides, XPath over the full node
> set is going to be slower than simply assembling a set of particular
> nodes on the first pass over the document done by parse_html.)
>
> Where parse_html has:
>
> when 'a'
> @links << Link.new(node)
>
> I added in
>
> when 'p'
> @paragraphs << Para.new(node)
>
> The Para class is nothing more than a wrapper for a generic node.
You could just collect the nodes itself. I see no need for a special
Para class...
> I then ask for page.paragraphs and grab the ones I want.
>
>
> BTW, while writing this post, I started thinking about my hackish
> implementation, and ended up replacing it with an arguably less hackish
> implementation, one that lets you do this:
>
>
> agent = WWW::Mechanize.new {|a|
> a.log = Logger.new(STDERR)
> }
> agent.watch_for_set = { 'style' => Style, 'p' => Para }
> page = agent.get( url )
> page.body
> paragraphs = page.elements[ 'p' ]
> styles = page.elements[ 'style' ]
ah, that looks nice.
That's my idea:
# nil === just return the node
agent.watch_for_set = { 'style' => nil, 'p' => Para }
agent.watches['p']
I'd expect #elements to behave like the #elements method of a
REXML::Node, so better use #watches.
Regards,
Michael