Asp Forum - extract information from a large text

phoenix

3/4/2008 3:05:00 PM

I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?

2 Answers

elof

3/4/2008 3:59:00 PM

On Wed, 5 Mar 2008 00:04:52 +0900, phoenix <zht.phoenix@gmail.com> wrote:
> I 'm new to ruby.I want to extract some useful information from a web
> page to generate a RSS feeds.My first instinct is to provide a regular
> expression like /sometext(.+?)sometext/, the problem is I can only get
> the first match to this regex, how can I iterate over the multiple
> matches?
> And further more, this kind of naive solution, is it too slow to
> search over a very large text?Because the performance requirement is
> high.Is there a better way to do this?
--

I recommend that you use the hpricot gem with lets you use xpath
expressions on html.

Install by typing this into a command line:

gem install hpricot

Here's a little example that extracts the link texts from a google search:

require 'rubygems'
require 'open-uri'
require 'hpricot'

g = Hpricot(open("http://www.google.com/search?q=hpricot%20x...))
(g/"a[@class='l']").each { |hit|
puts "#{(hit/"text()")}"
}
nil

Kristian

Robert Klemme

3/4/2008 6:00:00 PM

On 04.03.2008 16:04, phoenix wrote:
> I 'm new to ruby.I want to extract some useful information from a web
> page to generate a RSS feeds.My first instinct is to provide a regular
> expression like /sometext(.+?)sometext/, the problem is I can only get
> the first match to this regex, how can I iterate over the multiple
> matches?

String#scan.

> And further more, this kind of naive solution, is it too slow to
> search over a very large text?Because the performance requirement is
> high.Is there a better way to do this?

Try it out. Also Hpricot like Kristian suggested.

Cheers

robert

comp.lang.ruby

extract information from a large text

phoenix

elof

Robert Klemme

x Login to ForumsZone