Asp Forum - Scraping - comp.lang.ruby

Venkat

11/16/2007 2:37:00 PM

Is there a library/framework for scraping (web)?

I have a few scrapers written but would like to see if there are any
libraries available. I don't mean Mechanize and Hpricot or any other
parsers for (X)HTML.

How does one automate scraping scripts so that they download/scrape as
and when the websites change. Launch it as a service? I would appreciate
any examples for these.

TIA

-Venkat

7 Answers

venkat

11/17/2007 3:04:00 PM

Phrogz

11/17/2007 3:39:00 PM

On Nov 17, 8:03 am, venkat <ven...@nospam.com> wrote:
> Is there a library/framework for scraping (web)?
>
> I have a few scrapers written but would like to see if there are any
> libraries available. I don't mean Mechanize and Hpricot or any other
> parsers for (X)HTML.

If you don't mean those, what do you mean?

You can always simply fetch the raw page source and run regexps on it.
Is that more what you mean?

Peter Szinek

11/17/2007 3:51:00 PM

Phrogz wrote:
> On Nov 17, 8:03 am, venkat <ven...@nospam.com> wrote:
>> Is there a library/framework for scraping (web)?

Yeah.

I wrote a little article on this about a year ago, and I almost fell off
the chair when it was referenced in 'Learning Ruby' from O'Reilly.
It describes different web scraping possibilities in Ruby:

http://www.rubyra.../data-extraction-for-web-20-screen-scraping-in...

Since then I wrote a web scraping framework, scRUBYt! - based on the gem
download stats (nearly 8000) it's very popular. It's also very actively
developed and ... well enough self-advertisement, please read the
rubyrailways article and decide it for yourself :-)

Cheers,
Peter
___
http://www.rubyra...
http://s...

venkat

11/17/2007 4:36:00 PM

Phrogz wrote:
> On Nov 17, 8:03 am, venkat
> If you don't mean those, what do you mean?

I mean something that can be used after the content/webpage is parsed.
Even using Hpricot/Mechanize/Scrubyt one needs to write code to actually
extract the needed content because each page or website is different. I
am looking for something that can be extended easily to extract the
content. Such a framework can provide:

* a template or configuration files or other ways of providing inputs to
specify which part of page to extract.

* generic extraction functions for extracting data from: tables,
ordered/unordered lists, etc.

* save the extracted content -- to db, csv, or files.

* It can also include default extraction code for different types of
sites: blogs (wordpress, textmate), search engine results, wikis,
usenet, etc.

> You can always simply fetch the raw page source and run regexps on it.
> Is that more what you mean?

I already do the parsing and extraction using those parsers mentioned
above. But, to write a scraper for each site is repetitive. I do
understand that since the structure of web pages are different, such a
framework/library is difficult to write.

- Venkat

venkat

11/17/2007 4:39:00 PM

Peter Szinek wrote:

> Since then I wrote a web scraping framework, scRUBYt! - based on the gem
> download stats (nearly 8000) it's very popular. It's also very actively
> developed and ... well enough self-advertisement, please read the
> rubyrailways article and decide it for yourself :-)

Thanks, I did a cursory reading and bookmarked it. I will read it again
this weekend and post if I have any questions. Good material.

Thanks again

-Venkat

Daniel Brumbaugh Keeney

11/17/2007 11:33:00 PM

You might also like scrAPI
http://rubyforge.org/projec...

-Daniel Brumbaugh Keeney

(Alex Furman)

11/30/2007 4:54:00 PM

On Nov 16, 9:36 am, venkat <venkat@> wrote:
> Is there a library/framework forscraping(web)?
>
> I have a few scrapers written but would like to see if there are any
> libraries available. I don't mean Mechanize and Hpricot or any other
> parsers for (X)HTML.
>
> How does one automatescrapingscripts so that they download/scrape as
> and when the websites change. Launch it as a service? I would appreciate
> any examples for these.
>
> TIA
>
> -Venkat

You can check SWExplorerAutomation (SWEA) from http://.... SWEA
separates UI elements binding from the automation script. It makes
SWEA automation scripts more more resilient to UI changes and
dramatically decreases time needed for the script maintainance.

comp.lang.ruby

Scraping

Venkat

venkat

Phrogz

Peter Szinek

venkat

venkat

Daniel Brumbaugh Keeney

(Alex Furman)

x Login to ForumsZone