Asp Forum - Fun with WWW::Mechanize

James Britt

1/23/2005 4:45:00 AM

I thought I would see about adding some search function to ruby-doc, and
ended up taking Mike Neumann's WWW::Mechanize[0] for a test drive.

Sweet it be, as it took almost no time to get running code that takes
search words, queries Google, parses the results, and creates a new page
page.

Try it here:

http://www.ruby-doc.org/gs.rb/REXML%20...
http://www.ruby-doc.org/gs....
http://www.ruby-doc.org/gs.rb/testu...

I had to hack Mechanize to have it grab 'p' elements, but it is dead
easy to do.

Nice work, Herr Neumann. And a tip of the hat to folks behind Narf,
whose htmltools are needed for WWW::Mechanize.

James

[0] http://www.ntecs.de/blog/Blog/WWW-Mech...

7 Answers

Michael Neumann

1/23/2005 12:47:00 PM

James Britt wrote:
> I thought I would see about adding some search function to ruby-doc, and
> ended up taking Mike Neumann's WWW::Mechanize[0] for a test drive.

Nice to hear ;-)

> Sweet it be, as it took almost no time to get running code that takes
> search words, queries Google, parses the results, and creates a new page
> page.
>
> Try it here:
>
> http://www.ruby-doc.org/gs.rb/REXML%20...
> http://www.ruby-doc.org/gs....
> http://www.ruby-doc.org/gs.rb/testu...
>
> I had to hack Mechanize to have it grab 'p' elements, but it is dead
> easy to do.

What exactly do you had to hack? If it's worth, I'll add it to the lib.

Regards,

Michael

Austin Ziegler

1/23/2005 3:56:00 PM

On Sun, 23 Jan 2005 13:45:02 +0900, James Britt
<jamesUNDERBARb@neurogami.com> wrote:
> I thought I would see about adding some search function to
> ruby-doc, and ended up taking Mike Neumann's WWW::Mechanize[0] for
> a test drive.
>
> Sweet it be, as it took almost no time to get running code that
> takes search words, queries Google, parses the results, and
> creates a new page page.
>
> Try it here:
>
> http://www.ruby-doc.org/gs.rb/REXML%20...
> http://www.ruby-doc.org/gs....
> http://www.ruby-doc.org/gs.rb/testu...
>
> I had to hack Mechanize to have it grab 'p' elements, but it is
> dead easy to do.
>
> Nice work, Herr Neumann. And a tip of the hat to folks behind
> Narf, whose htmltools are needed for WWW::Mechanize.

Very cool, James. Be warned, though, that Google frowns on "screen
scraping", preferring people to use the SOAP API.

> [0] http://www.ntecs.de/blog/Blog/WWW-Mech...

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

James Britt

1/23/2005 4:38:00 PM

Michael Neumann wrote:
> James Britt wrote:
>> I had to hack Mechanize to have it grab 'p' elements, but it is dead
>> easy to do.
>
>
> What exactly do you had to hack? If it's worth, I'll add it to the lib.

At first, I just used the built-in 'links' property to get the search
result links. That sort of worked; I could get an array of URLs, but
they had no descriptive context. Looking at the HTML coming back from
Google I saw I really needed the 'p' elements that held the search
result URL + the description.

As best I could tell, the Page object has only a few built-in arrays
(links, forms, maybe another, I don't recall) that get populated when
calling parse_html. Adding another array, and telling parse_html to
populate this array, was super easy.

In retrospect I think I could have done some sort of Xpath-thing over
the tree of node held by the Page object, but I just took what seemed to
be the easiest route at the time. (besides, XPath over the full node
set is going to be slower than simply assembling a set of particular
nodes on the first pass over the document done by parse_html.)

Where parse_html has:

when 'a'
@links << Link.new(node)

I added in

when 'p'
@paragraphs << Para.new(node)

The Para class is nothing more than a wrapper for a generic node.

I then ask for page.paragraphs and grab the ones I want.

BTW, while writing this post, I started thinking about my hackish
implementation, and ended up replacing it with an arguably less hackish
implementation, one that lets you do this:

agent = WWW::Mechanize.new {|a|
a.log = Logger.new(STDERR)
}
agent.watch_for_set = { 'style' => Style, 'p' => Para }
page = agent.get( url )
page.body
paragraphs = page.elements[ 'p' ]
styles = page.elements[ 'style' ]

You just have to have the calling code define the classes passed in as
part of the 'watch_for_set' hash. Each of these classes then has to
implement this constructor:

def initialize( node ) ; end

It's up to each class then to extract what data it wants from the node.

So one could write a Style class that grabs the text value of the node
and makes each CSS selector available for inspection.

(page.elements of course only has an array for each of those element
names passed in.)

James

James Britt

1/23/2005 4:48:00 PM

Austin Ziegler wrote:
>
> Very cool, James. Be warned, though, that Google frowns on "screen
> scraping", preferring people to use the SOAP API.

Yes, well that's one reason it is not linked from the main ruby-doc page.

I started of thinking I would just add a way to do a straight-up Google
search, limited to the ruby-doc.org site, but was unhappy with the
resulting page being, well, another site.

Framing the search results page seemed undesirable, so I thought about
scrapping the results. Curiously enough, while putting this page up on
ruby-doc.org, I came across some old code that did actually use the
Google API, but apparently some required files were lost in assorted
site moves and upgrades.

Anyway, I though this was a neat enough demo of how easy it is to use
Mechanize that I should share it. How the actual search page ends up is
another matter. Time to go find my Google API key perhaps.

(Regarding Google frowning on scraping, I wondered if this was because
of volume, which I expect would be low, or that it typically means
omitting the sponsored ads. I figured I would add to code to *keep* the
sponsored ads in the resulting page, figuring that's part of Google's
revenue model. Maybe I can just keep the entire Google search results
page, and simply insert a set of links to get back to ruby-doc.org. So
many choices.)

James

Michael Neumann

1/23/2005 5:06:00 PM

James Britt wrote:
> Michael Neumann wrote:
>
>> James Britt wrote:
>>
>>> I had to hack Mechanize to have it grab 'p' elements, but it is dead
>>> easy to do.
>>
>>
>>
>> What exactly do you had to hack? If it's worth, I'll add it to the lib.
>
>
>
> At first, I just used the built-in 'links' property to get the search
> result links. That sort of worked; I could get an array of URLs, but
> they had no descriptive context. Looking at the HTML coming back from
> Google I saw I really needed the 'p' elements that held the search
> result URL + the description.

I've added an find_all_recursive method to REXML::Node.

This should just return all paragraph nodes:

root.find_all_recursive {|n| n.name == 'p'}

> As best I could tell, the Page object has only a few built-in arrays
> (links, forms, maybe another, I don't recall) that get populated when
> calling parse_html. Adding another array, and telling parse_html to
> populate this array, was super easy.

I add more if there's need for it.

> In retrospect I think I could have done some sort of Xpath-thing over
> the tree of node held by the Page object, but I just took what seemed to
> be the easiest route at the time. (besides, XPath over the full node
> set is going to be slower than simply assembling a set of particular
> nodes on the first pass over the document done by parse_html.)
>
> Where parse_html has:
>
> when 'a'
> @links << Link.new(node)
>
> I added in
>
> when 'p'
> @paragraphs << Para.new(node)
>
> The Para class is nothing more than a wrapper for a generic node.

You could just collect the nodes itself. I see no need for a special
Para class...

> I then ask for page.paragraphs and grab the ones I want.
>
>
> BTW, while writing this post, I started thinking about my hackish
> implementation, and ended up replacing it with an arguably less hackish
> implementation, one that lets you do this:
>
>
> agent = WWW::Mechanize.new {|a|
> a.log = Logger.new(STDERR)
> }
> agent.watch_for_set = { 'style' => Style, 'p' => Para }
> page = agent.get( url )
> page.body
> paragraphs = page.elements[ 'p' ]
> styles = page.elements[ 'style' ]

ah, that looks nice.

That's my idea:

# nil === just return the node
agent.watch_for_set = { 'style' => nil, 'p' => Para }
agent.watches['p']

I'd expect #elements to behave like the #elements method of a
REXML::Node, so better use #watches.

Regards,

Michael

Michael Neumann

1/23/2005 5:07:00 PM

James Britt wrote:
> Austin Ziegler wrote:
>
>>
>> Very cool, James. Be warned, though, that Google frowns on "screen
>> scraping", preferring people to use the SOAP API.
>
>
> Yes, well that's one reason it is not linked from the main ruby-doc page.
>
> I started of thinking I would just add a way to do a straight-up Google
> search, limited to the ruby-doc.org site, but was unhappy with the
> resulting page being, well, another site.
>
> Framing the search results page seemed undesirable, so I thought about
> scrapping the results. Curiously enough, while putting this page up on
> ruby-doc.org, I came across some old code that did actually use the
> Google API, but apparently some required files were lost in assorted
> site moves and upgrades.
>
> Anyway, I though this was a neat enough demo of how easy it is to use
> Mechanize that I should share it. How the actual search page ends up is
> another matter. Time to go find my Google API key perhaps.

Would you like to share the code with us? Should I include it as an
example into WWW::Mechanize?

Regards,

Michael

James Britt

1/23/2005 5:42:00 PM

Michael Neumann wrote:
> James Britt wrote:
>> ..
>> Anyway, I though this was a neat enough demo of how easy it is to use
>> Mechanize that I should share it. How the actual search page ends up
>> is another matter. Time to go find my Google API key perhaps.
>
>
> Would you like to share the code with us? Should I include it as an
> example into WWW::Mechanize?

Sure. The live version uses the first pass at the Mechanize hack; the
runs-at-home version uses the more flexible version I wrote while
replying to your earlier post. ("Ruby: Ain't it cool?")

But that code is different from your suggestion (and, I gather,
implementation) on how else to to this (though in practice it is quite
similar).

So, yes, if Mechanize adopts a way to pass in a 'watch_for set', and
then makes them available via 'watches', then the Google scrape code
might make a good example, even if it never goes 'live' on ruby-doc.org

I'd just need to clean it up to use the most current API.

Note that root.find_all_recursive {|n| n.name == 'p'} would work as
well as what I do now; my Para class does nothing more than call
node.to_s. The advantage, though, to having parse_html collect nodes on
the HTML stream parse is that it is faster than re-iterating over the
node tree every time you want a set of nodes.

My Google search code, then, is a somewhat gratuitous use of
agent.watch_for_set (it is a good example of "Gee, I wonder if ..."),
though I could perhaps add something that gives a more practical example
of collecting nodes as custom classes.

Maybe create a version of Para that exposes the element CSS class and id
as properties. Then replace the element CSS class value with one of my
own to better control the resulting page style. Or something.

Thanks,

James

comp.lang.ruby

Fun with WWW::Mechanize

James Britt

Michael Neumann

Austin Ziegler

James Britt

James Britt

Michael Neumann

Michael Neumann

James Britt

x Login to ForumsZone