[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Performance comparison between screen scrapers

Conrad Chu

1/11/2007 8:37:00 AM

Does anyone know how the following screen scrapers perform against one
another?

* ScrAPI
* RubyfulSoup
* HTree
* Hpricot

I'm trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

--
Posted via http://www.ruby-....

6 Answers

Jano Svitok

1/11/2007 9:51:00 AM

0

On 1/11/07, Conrad Chu <conradchu@conradchu.com> wrote:
> Does anyone know how the following screen scrapers perform against one
> another?
>
> * ScrAPI
> * RubyfulSoup
> * HTree
> * Hpricot
>
> I'm trying to write up a tool where a person enters in a URL, and I use
> an AJAX call to scrape the contents of that URL for title, description,
> etc. So speed is really important (I suppose, regular expressions would
> be the fastest, but I need something that is tree-based and supports
> HTML tidying)
>
> Thanks
> Conrad

There was a comparision done on this list some time ago. Search for lib names.

Ross Bamford

1/11/2007 10:18:00 AM

0

On Thu, 11 Jan 2007 08:36:44 -0000, Conrad Chu <conradchu@conradchu.com>
wrote:

> Does anyone know how the following screen scrapers perform against one
> another?
>
> * ScrAPI
> * RubyfulSoup
> * HTree
> * Hpricot
>
> I'm trying to write up a tool where a person enters in a URL, and I use
> an AJAX call to scrape the contents of that URL for title, description,
> etc. So speed is really important (I suppose, regular expressions would
> be the fastest, but I need something that is tree-based and supports
> HTML tidying)
>
> Thanks
> Conrad
>

I don't know about ScrAPI or HTree, but I recently blogged an informal
benchmark run between Rubyful Soup, Hpricot, and the (still developmental)
libxml2 HTML parser binding in Libxml-ruby. It's at:

http://cloverhead.blogspot.com/2006/12/bit-of-benchma...

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Timothy Goddard

1/13/2007 1:13:00 PM

0

Conrad Chu wrote:
> Does anyone know how the following screen scrapers perform against one
> another?
>
> * ScrAPI
> * RubyfulSoup
> * HTree
> * Hpricot
>
> I'm trying to write up a tool where a person enters in a URL, and I use
> an AJAX call to scrape the contents of that URL for title, description,
> etc. So speed is really important (I suppose, regular expressions would
> be the fastest, but I need something that is tree-based and supports
> HTML tidying)
>
> Thanks
> Conrad
>
> --
> Posted via http://www.ruby-....

I haven't used them all but Hpricot is fast (the parser is written in C
with Ragel), error tolerant and perfect for this task. Take a look at
its website for a guide on how to use it.

The Revd

10/28/2012 10:06:00 PM

0

On Sun, 28 Oct 2012 13:38:55 -0700 (PDT), A Moose in Love
<parkstreetbooboo@gmail.com> wrote:

>On Oct 28, 4:32?pm, brian lamb <brianlambsbig...@yahoo.ca> wrote:
>> & we were correct!
>>
>> Sure sounds right to me!
>>
>> Ayup, eh!
>
>it sounds right to you because you're ignorant and stupid. true there
>were some dangerous mothers that came over, knife fighting seems to
>have been a pastime in Hungary.
>I had a partner(at work, a partner is the asshole who works the same
>machine; opposite shift) who was Hungarian. He told me of his knife
>fighting exploits back in the old country. he had the scars to prove
>it. Unlike you. The armchair slice and dice hombre! Que bueno!

All that jew wog Lambsky has is cum streaks on his face from all the
time spent gobbling under Scotsmen's kilts during the two World Wars!

NoSpamAtAll

10/28/2012 10:11:00 PM

0

In article <quar88h4d04itiu8lpcf3hfsir2fncqrci@4ax.com>,
The Revd <peeling@degenerate.Grik> wrote:

[...]

Repetition, shiteating limey bastard who sucks the rectum
of a dutch perverted pig who sucks the rectum of a Columbia
SC redneck who sucks the rectums of muzzie donkeys.

The Revd

10/28/2012 10:39:00 PM

0

On Sun, 28 Oct 2012 22:11:21 +0000 (UTC), SmallHernia aka NoSpamAtAll
<spamnot@not.home> wrote:

>In article <quar88h4d04itiu8lpcf3hfsir2fncqrci@4ax.com>,
>The Revd <peeling@degenerate.Grik> wrote:
>
>[...]
>
>Repetition

It's all you know how to do, kikey!