[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Ruby screen scraping

Chris Gallagher

11/19/2006 4:17:00 PM

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

Thanks

--
Posted via http://www.ruby-....

26 Answers

Marcelo Alvim

11/19/2006 4:22:00 PM

0

Hi,

On 11/19/06, Chris Gallagher <cgallagher@gmail.com> wrote:
> I'm looking at creating a ruby script that will firstly access our
> cruise control page on localhost and examin the page to see the values
> on the page, so basically telling us if the build succeeded or failed.

If you want screen scraping, I would tell you to look at why's
excellent Hpricot HTML parser. It's really simple to use and very
effective.

http://code.whytheluckystiff.ne...

Cheers,
Alvim.

Daniel Lucraft

11/19/2006 4:28:00 PM

0

For HTML scraping I recommend scrAPI.

gem install scrapi

homepage:
http://blog.labnotes.org/catego...

Example scraper:

Scraper.define do
attr_accessor :title, :author, :pub_date, :content

process "div#GuardianArticle > h1", :title => :text
process "div#GuardianArticle > font[size=2] > b" do |element|
@author = element.children[0].content
@pub_date = element.children[2].content.strip
end
process "div#GuardianArticleBody", :content => :text
end

--
Posted via http://www.ruby-....

Chris Gallagher

11/19/2006 4:41:00 PM

0

thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=...
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
}

which is returning a result of:

C:\>ruby script2.rb
http://www.ruby...
http://www.ruby...en/20020101.html
http://www.rubyon...
http://www.rubyce...
http://www.rubyce...book/
http://en.wikipedia.org..._programmin...
http://en.wikipedia.org...
http://www.w3.or...
http://poignant...
http://www.zenspider.com/Languages/Ruby/Qui...

Cheers.

--
Posted via http://www.ruby-....

Peter Szinek

11/19/2006 4:44:00 PM

0

Chris Gallagher wrote:
> Hi,
>
> I'm looking at creating a ruby script that will firstly access our
> cruise control page on localhost and examin the page to see the values
> on the page, so basically telling us if the build succeeded or failed.

Once you have the page (open-uri if you know the URL exactly, or
www::mechanize if you need to navigate there (i.e. fill textfields,
click buttons etc)) I recommend to check out these possibilities:

1) regular expressions
2) HPpricot
3) scrAPI
4) Rubyful soup


Regular expressions would be the most old-school solution, in some cases
such a wrapper is the most robust (but since you are in control of the
generated page as I understood, robustness is possibly not an issue).

If you can't do it with regexps, HPricot will be most probably adequate
(I would need to see the concrete page).

Finally, if neither of the above works, you should try scrAPI - and
though I don't think so you should fail after this point, Rubyful soup
is another possibility to check out.


Peter
__
http://www.rubyra...





Peter Szinek

11/19/2006 4:46:00 PM

0

Chris Gallagher wrote:
> thanks guys I'll look into both of them.
>
> Another question I would have is how would i then get this scraped info
> to insert into a mysql database called say "build" and a table called
> "results".
>
> For now if you could base answers on the following htree code?
>
> require 'open-uri'
> require 'htree'
> require 'rexml/document'
>
> url = "http://www.google.com/search?q=...
> open(url) {
> |page| page_content = page.read()
> doc = HTree(page_content).to_rexml
> doc.root.each_element('//a[@class="l"]') {
> |elem| puts elem.attribute('href').value }
> }
Something along the lines of

require "mysql"

dbh = Mysql.real_connect("localhost", "chris", "", "build")
dbh.query("
INSERT INTO results
VALUES (whatever)

Cheers,

Peter
__
http://www.rubyra...

Chris Gallagher

11/19/2006 5:01:00 PM

0

Thanks for the help.

Ill get on with it and see how it goes :)

--
Posted via http://www.ruby-....

Peter Szinek

11/19/2006 5:11:00 PM

0

OK,here is the full code:

require 'open-uri'
require 'htree'
require 'rexml/document'
require 'mysql'

url = "http://www.google.com/search?q=...
results = []

open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| results << elem.attribute('href').value }

dbh = Mysql.real_connect("localhost", "peter", "****", "build")

results.each do |result|
dbh.query("INSERT INTO result VALUES ('#{result}')")
end
}

HTH,

Peter
__
http://www.rubyra...

Chris Gallagher

11/19/2006 5:20:00 PM

0

wow, thanks for that code.

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

--
Posted via http://www.ruby-....

Peter Szinek

11/19/2006 5:29:00 PM

0

Chris Gallagher wrote:
> wow, thanks for that code.
Welcome :-)

> One question though. Does the name of the field in the table which the
> scraped information is going to be inserted into need to be specified in
> the code? Or is it already and i'm missing something here?
>

My code assumed that the table has one column (e.g. 'url' in this case)
and the values were inserted into that column.

Otherwise if you have more columns, you can do this:

INSERT INTO people
(name, age) VALUES('Peter Szinek', '23' ).

You can do

INSERT INTO people VALUES('Peter Szinek', '23' )

as well, but in this case you have to be sure that the columns in your
DB are in the same order as in your insert query. In the first example
you don't have to care about the column ordering in the DB, as far as
the mapping between the column names (first pair of parens) and the
values (second pair of parens) are O.K.

HTH,
Peter

__
http://www.rubyra...



Chris Gallagher

11/19/2006 5:53:00 PM

0

ah thats great.

thanks again for your help :)

--
Posted via http://www.ruby-....