Asp Forum - [ANN] scRUBYt! - Hpricot and WWW::Mechanize on even more steroids, 0.2.6 released

Peter Szinek

3/26/2007 8:02:00 AM

Hello all,

scRUBYt! version 0.2.6 has been released with some great new features,
tons of bugfixes and lot of changes overall which should greatly affect
the reliability of the system.

============
What's this?
============

scRUBYt! is a very easy to learn and use, yet powerful Web scraping
framework based on Hpricot and mechanize. It's purpose is to free you
from the drudgery of web page crawling, looking up HTML tags,
attributes, XPaths, form names and other typical low-level web scraping
woes by figuring these out from your examples copy'n'pasted from the Web
page.

===========
What's new?
===========

A lot of long-awaited features have been added: most notably, automatic
crawling to the detail pages, which was the most requested feature in
scRUBYt!?s history ever.

Another great addition is the improved example generation - you don?t
have to use the whole text of the element you would like to match
anymore - it is enough to specify a substring, and the first element
that contains the string will be returned. Moreover, it is possible to
create compound examples like this:

flight :begins_with => 'Arrival', :contains /\d{4}/, :ends_with => '20:00'

The crawling through next links has been greatly improved - it is
possible to use images as next links, to generate URLs instead of
clicking on the next link, and a great deal of bugs (including the
infamous google next link problem) have been fixed.

An enormous amount of bugs were fixed and the whole system was tested
thoroughly, so the overall reliability should be improved a lot as
opposed to the previous releases.

Something non-software related: 4 people have joined the development, so
I guess there is much, much more to come in the future!

=========
CHANGELOG
=========

* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern
had to be a string. Now it can be a hash as well, like
{:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp
as well, and need not (but still possible of course) to specify the
whole content of the node - nodes that contain the string/match the
regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did
not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different
methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the
next link text is not an <a>, traverse down until the <a> is found;
if it is still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed
out (or otherwise present but has no href attribute (Credit: Robert
Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as
it uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and
stabilization

============
Announcement
============

On popular demand, there is a new forum to discuss everything scRUBYt!
related:

http://agora.s...

You are welcome to sign up tell your opinion, ask for features, report
bugs or discuss stuff - or to just look around what other's are saying.

================
Closing thoughts
================

Please keep the feedback coming - your contributions are a key factor to
scRUBYt!?s succes. This is not an exaggeration or a feeble attempt at
flattery - since we (obviously) can not test everything on every
possible page, we can make scRUBYt! truly powerful only if you send us
all the quirks and problems you encounter during scraping, as well as
your suggestions and ideas. Thanks everyone!

Cheers,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby.

3 Answers

Glenn Gillen

3/28/2007 9:36:00 PM

Peter,

Apologies for the brevity, on a blackberry.

All but two of the unit tests are passing with firewatir. Can you
confirm what the proxy and mechanize_doc params are used for in the
fetch method? Couldn't find them used anywhere. Mind if I rename
methods and variable away from being so mechanize specific?

Hope to commit changes to my 3.0 tag tomorrow afternoon

On 3/26/07, Peter Szinek <peter@rubyrailways.com> wrote:
> Hello all,
>
> scRUBYt! version 0.2.6 has been released with some great new features,
> tons of bugfixes and lot of changes overall which should greatly affect
> the reliability of the system.
>
> ============
> What's this?
> ============
>
> scRUBYt! is a very easy to learn and use, yet powerful Web scraping
> framework based on Hpricot and mechanize. It's purpose is to free you
> from the drudgery of web page crawling, looking up HTML tags,
> attributes, XPaths, form names and other typical low-level web scraping
> woes by figuring these out from your examples copy'n'pasted from the Web
> page.
>
> ===========
> What's new?
> ===========
>
> A lot of long-awaited features have been added: most notably, automatic
> crawling to the detail pages, which was the most requested feature in
> scRUBYt!'s history ever.
>
> Another great addition is the improved example generation - you don't
> have to use the whole text of the element you would like to match
> anymore - it is enough to specify a substring, and the first element
> that contains the string will be returned. Moreover, it is possible to
> create compound examples like this:
>
> flight :begins_with => 'Arrival', :contains /\d{4}/, :ends_with => '20:00'
>
> The crawling through next links has been greatly improved - it is
> possible to use images as next links, to generate URLs instead of
> clicking on the next link, and a great deal of bugs (including the
> infamous google next link problem) have been fixed.
>
> An enormous amount of bugs were fixed and the whole system was tested
> thoroughly, so the overall reliability should be improved a lot as
> opposed to the previous releases.
>
> Something non-software related: 4 people have joined the development, so
> I guess there is much, much more to come in the future!
>
> =========
> CHANGELOG
> =========
>
> * [NEW] Automatically crawling to and extracting from detail pages
> * [NEW] Compound example specification: So far the example of a pattern
> had to be a string. Now it can be a hash as well, like
> {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
> * [NEW] More sophisticated example specification: Possible to use regexp
> as well, and need not (but still possible of course) to specify the
> whole content of the node - nodes that contain the string/match the
> regexp will be returned, too
> * [NEW] Possibility to force writing text in case of non-leaf nodes
> * [NEW] Crawling to the next page now possible via image links as well
> * [NEW] Possibility to define examples for any pattern (before it did
> not make sense for ancestors)
> * [NEW] Implementation of crawling to the next page with different
> methods
> * [NEW] Heuristics: if something ends with _url, it is a shortcut for:
> some_url 'href', :type => :attribute
> * [FIX] Crawling to the next page (the broken google example): if the
> next link text is not an <a>, traverse down until the <a> is found;
> if it is still not found, traverse up until it is found
> * [FIX] Crawling to next pages does not break if the next link is greyed
> out (or otherwise present but has no href attribute (Credit: Robert
> Au)
> * [FIX] DRY-ed next link lookup - it should be much more robust now as
> it uses the 'standard' example lookup
> * [NEW] Correct exporting of detail page extractors
> * [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
> * [NEW] New examples for the new featutres
> * [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and
> stabilization
>
> ============
> Announcement
> ============
>
> On popular demand, there is a new forum to discuss everything scRUBYt!
> related:
>
> http://agora.s...
>
> You are welcome to sign up tell your opinion, ask for features, report
> bugs or discuss stuff - or to just look around what other's are saying.
>
> ================
> Closing thoughts
> ================
>
> Please keep the feedback coming - your contributions are a key factor to
> scRUBYt!'s succes. This is not an exaggeration or a feeble attempt at
> flattery - since we (obviously) can not test everything on every
> possible page, we can make scRUBYt! truly powerful only if you send us
> all the quirks and problems you encounter during scraping, as well as
> your suggestions and ideas. Thanks everyone!
>
> Cheers,
> Peter
> __
> http://www.rubyra... :: Ruby and Web2.0 blog
> http://s... :: Ruby web scraping framework
> http://rubykitch... :: The indexed archive of all things Ruby.
>
>
>

--
Glenn

6119 Dead, 1262 since 1/20/09

4/1/2011 1:29:00 AM

On Thu, 31 Mar 2011 17:04:55 -0700 (PDT), Phlip <phlip2005@gmail.com>
wrote:

>On Mar 31, 4:53?pm, Not Sure <fred1321...@gmail.com> wrote:
>
>> Which of the 57 states did he make this comment in?
>
>The Beatles think there are 8 days in a week (also a joke referring to
>exhaustion), but you give them a pass because they are...

He also admires coaches who demand 110% from the players.

Tom Fitzpatrick

4/1/2011 8:23:00 PM

On Mar 31, 6:53 pm, Not Sure <fred1321...@gmail.com> wrote:
> On Mar 31, 4:30 pm, Harry Hope <riv...@ix.netcom.com> wrote:
>
>
>
>
>
> >http://tpmdc.talkingpointsmemo.com/2011/03/frosh-goper-on-f......
>
> > March 31, 2011
>
> > Frosh GOPer On Foreign Affairs Committee Wonders If Obama Will Attack
> > 'Africa' After Libya
>
> > By Brian Beutler
>
> > The old joke goes that most people can't find whatever country the
> > United States is at war with on a map.
>
> > Same seems to be true for Rep. Tom Marino (R-PA), a freshman
> > congressman who also sits on the House Foreign Affairs Committee.
>
> > He's quoted in the Times-Tribune questioning President Obama's Libya
> > strategy, and lack of deference to Congress.http://thetimes-tribune.com/news/gop-reps-doubt-libya-missi......
>
> > "The bottom line is I wish the president would have told us, talked to
> > Congress about what is the plan. Is there a plan? Is the mission to
> > take Gadhafi out?"
>
> > Mr. Marino asked.... "Where does it stop?" he said.
>
> > "Do we go into Africa next? I don't want to sound callous or cold, but
> > this could go on indefinitely around the world."
>
> > Yes, Libya is in Africa.
>
> > ______________________________________
>
> > And this Republican moron's on the House Foreign Affairs Committee....
>
> > YIKES!!!
>
> > Harry
>
> Which of the 57 states did he make this comment in?- Hide quoted text -
>
> - Show quoted text -

Try not to cry.

comp.lang.ruby

[ANN] scRUBYt! - Hpricot and WWW::Mechanize on even more steroids, 0.2.6 released

Peter Szinek

Glenn Gillen

6119 Dead, 1262 since 1/20/09

Tom Fitzpatrick

x Login to ForumsZone