venkat
11/17/2007 4:36:00 PM
Phrogz wrote:
> On Nov 17, 8:03 am, venkat
> If you don't mean those, what do you mean?
I mean something that can be used after the content/webpage is parsed.
Even using Hpricot/Mechanize/Scrubyt one needs to write code to actually
extract the needed content because each page or website is different. I
am looking for something that can be extended easily to extract the
content. Such a framework can provide:
* a template or configuration files or other ways of providing inputs to
specify which part of page to extract.
* generic extraction functions for extracting data from: tables,
ordered/unordered lists, etc.
* save the extracted content -- to db, csv, or files.
* It can also include default extraction code for different types of
sites: blogs (wordpress, textmate), search engine results, wikis,
usenet, etc.
> You can always simply fetch the raw page source and run regexps on it.
> Is that more what you mean?
I already do the parsing and extraction using those parsers mentioned
above. But, to write a scraper for each site is repetitive. I do
understand that since the structure of web pages are different, such a
framework/library is difficult to write.
- Venkat