[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Simple screen scraper using scrAPI

doog

11/28/2006 11:14:00 PM

I'm a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

All I need it to do is:
1) Go to a specified web page
2) Use a CSS selector to grab and print out any section of the page

It does not need to find links on the page or crawl.

I tried the eBay example at
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit...
and have tried the recommended "require" and Tidy.path statements,
but couldn't find a combination that works.

-Doug


14 Answers

Paul Lutus

11/29/2006 1:42:00 AM

0

doog wrote:

> I'm a Ruby novice. Does anyone have an example of a simple screen
> scraper in Ruby that uses scrAPI (and works on Mac OS X)?
>
> All I need it to do is:
> 1) Go to a specified web page
> 2) Use a CSS selector to grab and print out any section of the page
>
> It does not need to find links on the page or crawl.
>
> I tried the eBay example at
>
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit...
> and have tried the recommended "require" and Tidy.path statements,
> but couldn't find a combination that works.

Please tell me something. Do you want to:

1. Parse a Web page using scrAPI, or

2. Parse a Web page.

If you are more concerned with parsing content from a Web page than using
scrAPI, then I can help you.

--
Paul Lutus
http://www.ara...

doog

11/29/2006 2:53:00 AM

0

Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

-Doug

Paul Lutus wrote:
> doog wrote:
>
>> I'm a Ruby novice. Does anyone have an example of a simple screen
>> scraper in Ruby that uses scrAPI (and works on Mac OS X)?
>>
>> All I need it to do is:
>> 1) Go to a specified web page
>> 2) Use a CSS selector to grab and print out any section of the page
>>
>> It does not need to find links on the page or crawl.
>>
>> I tried the eBay example at
>>
> http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit...
>> and have tried the recommended "require" and Tidy.path statements,
>> but couldn't find a combination that works.
>
> Please tell me something. Do you want to:
>
> 1. Parse a Web page using scrAPI, or
>
> 2. Parse a Web page.
>
> If you are more concerned with parsing content from a Web page than using
> scrAPI, then I can help you.
>

Marcelo Alvim

11/29/2006 3:29:00 AM

0

On 11/28/06, doog <doog@google.com> wrote:
> Thanks so much. Parsing a web page is sufficient, and would
> be a great starting point.

If parsing a web page is sufficient, I definitely recommend Hpricot.
It's simple, easy, and does the job very well.

http://code.whytheluckystiff.ne...

Cheers,
Alvim.

Paul Lutus

11/29/2006 6:14:00 AM

0

doog wrote:

> Thanks so much. Parsing a web page is sufficient, and would
> be a great starting point.

Okay, here is a simple parser in ordinary Ruby, it will give you some ideas
about what is involved in parsing.

There are many libraries that do much more than this script does, some of
them have steep learning curves, many offer exotic ways to acquire
particular kinds of content.

This is a simple parser that returns an array containing all the table
content in the target Web page. I wrote it earlier today for someone who
wanted to scrape a yahoo.com financial page, which explains the target
page, something easy to change:

------------------------------------------------

#!/usr/bin/ruby -w

require 'net/http'

# read the page data

http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )

# BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

out_tables = []
table_data = parse_html(page,"table")
table_data.each do |table|
out_rows = []
row_data = parse_html(table,"tr")
row_data.each do |row|
out_cells = parse_html(row,"td")
out_cells.each do |cell|
cell.gsub!(%r{<.*?>},"")
end
out_rows << out_cells
end
out_tables << out_rows
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts "#{"\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts "#{"\t" * (tab+1)}#{item}"
end
puts "#{"\t" * tab}}"
end
n += 1
end
end

parse_nested_array(out_tables)

------------------------------------------------

This program emits an indexed, indented listing of the table content that it
extracted, so you can then customize it by acquiring particular table cells
through use of the provided index numbers.

It should work with any Web page that has the interesting content embedded
in tables, and whose syntax is reliable.

The primary value of this program is to show you how easy it is to scrape
pages using Ruby, and give you a starting point you can customize to meet
your own requirements.

--
Paul Lutus
http://www.ara...

Peter Szinek

11/29/2006 7:54:00 AM

0

doog wrote:
> I'm a Ruby novice. Does anyone have an example of a simple screen
> scraper in Ruby that uses scrAPI (and works on Mac OS X)?

Though I don't seem to understand the intensity of the holy war Paul is
leading against anything that is not hand-coded on the fly, this time I
will have to agree with him: the request 'I would like to write a screen
scraper in scrAPI (or Hpricot, or xxx)' is not always the right way.
Screen scraping is can be very tedious and complex, and it really
depends on the input page, the type of the actions you would like to
perform (fetching the page is trivial? do you need to navigate? (i.e.
fill forms, lick links) how complex is the parsing?) quality you would
like to achieve, robustness (i.e. if the underlying page changes, the
scraper should still perform well) and another 10k things. Some time ago
I wrote a small article on this:

http://www.rubyra.../data-extraction-for-web-20-screen-scraping-in-...

It is a bit outdated now (I am planning to beef it up with FireWatir,
Hpricot and other sections) but it can help you as a starting point.

Conclusion: it depends on the page and task at hand what/how should be
used. I suggest that if you have a concrete problem, drop us a mail and
we will figure out something.

Cheers,
Peter

__
http://www.rubyra...


user@domain.invalid

11/29/2006 9:28:00 AM

0

I Paul ! Instead of talking about scrAPI would you tell us the magic
that was inside GrafForth ???


As a teen, I was fan of the bass player of Iron Maiden metal band and
YOU ;-)
Kidding a little but not that much...


Sorry for the offtopism !

Paul Lutus

11/29/2006 4:56:00 PM

0

Peter Szinek wrote:

> doog wrote:
>> I'm a Ruby novice. Does anyone have an example of a simple screen
>> scraper in Ruby that uses scrAPI (and works on Mac OS X)?
>
> Though I don't seem to understand the intensity of the holy war Paul is
> leading against anything that is not hand-coded on the fly,

For purposes of clarification, I simply want newbies to see how easy it is
to write these things in ordinary Ruby code.

And, lest there be any confusion on this point, I always say that, or
something like it -- seemingly to no effect.

--
Paul Lutus
http://www.ara...

Paul Lutus

11/29/2006 5:01:00 PM

0

Zouplaz wrote:

> I Paul ! Instead of talking about scrAPI would you tell us the magic
> that was inside GrafForth ???

Ha! A reference to a different era, a distant voice. :)

For the other readers, GraForth was a Forth embodiment I cooked up about 25
years ago, at a time when most things were written in assembly. It
supported a kind of graphics that would be embarrassingly crude by modern
standards.

It was basically a way to get around the fact that there were almost no
high-level languages, and none that mere mortals could either afford or
support with the small HDD and RAM sizes of the era.

> As a teen, I was fan of the bass player of Iron Maiden metal band and
> YOU ;-)

I'm glad to see you had your priorities straight. :)

--
Paul Lutus
http://www.ara...

user@domain.invalid

11/30/2006 12:04:00 PM

0

le 29/11/2006 18:01, Paul Lutus nous a dit:
>
> Ha! A reference to a different era, a distant voice. :)
>

We should not forget these times (and I didn't lived the 70's - that was
certainly something else) even if there's no VW van anywhere

Art is a performance and performance comes from constraints...

> For the other readers, GraForth was a Forth embodiment I cooked up about 25
> years ago, at a time when most things were written in assembly. It
> supported a kind of graphics that would be embarrassingly crude by modern
> standards.
>

Hey ! I remember a demo of GraForth showing a 3D rotating cube (maybe
color filled) - Not that bad for the only mhz of my Apple IIc (no I
didn't had that wonderful II+, I was a little late)

> It was basically a way to get around the fact that there were almost no
> high-level languages, and none that mere mortals could either afford or
> support with the small HDD and RAM sizes of the era.

Did you used any cross compilation systems to code GraForth or
AppleWriter ? (you know that kind of systems that most early 80s teen
geek dreamed to have an access onto)

>
>> As a teen, I was fan of the bass player of Iron Maiden metal band and
>> YOU ;-)
>
> I'm glad to see you had your priorities straight. :)
>

:-))


Paul Lutus

11/30/2006 4:22:00 PM

0

Zouplaz wrote:

/ ...

>> For the other readers, GraForth was a Forth embodiment I cooked up about
>> 25 years ago, at a time when most things were written in assembly. It
>> supported a kind of graphics that would be embarrassingly crude by modern
>> standards.
>>
>
> Hey ! I remember a demo of GraForth showing a 3D rotating cube (maybe
> color filled)

How ironic that you should mention that. I was recently deposed by a group
of lawyers defending all the big game-software players (Microsoft,
Nintendo, et.al.) against a patent lawsuit that claimed they had patented
the idea of using a joystick or keyboard to control an onscreen 3D graphics
display. If they had prevailed in their claim, it would have been a gold
mine.

MIllions of dollars of royalties were at stake. Then a researcher discovered
I had written GraForth and an earlier program called Apple World that did
what the patent claimed, before the date of the patent. Basically my
testimony took the wind out of their sails.

> - Not that bad for the only mhz of my Apple IIc (no I
> didn't had that wonderful II+, I was a little late)
>
>> It was basically a way to get around the fact that there were almost no
>> high-level languages, and none that mere mortals could either afford or
>> support with the small HDD and RAM sizes of the era.
>
> Did you used any cross compilation systems to code GraForth or
> AppleWriter ? (you know that kind of systems that most early 80s teen
> geek dreamed to have an access onto)

GraForth, yes, Apple Writer, no. I ported GraForth over to the early IBM PC,
but Apple Writer was mired in assembly language. I had to completely
rewrite Apple Writer for the PC (under a different name, of course) because
it was plain assembly, no abstractions. GraForth was, after all, Forth, so
it was more portable.

--
Paul Lutus
http://www.ara...