[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

listing all the html links

Dado

5/3/2006 9:28:00 PM

how can I use ruby to list all the html links on a site, ?

Tahnks


13 Answers

Jeff Schwab

5/3/2006 9:59:00 PM

0

Dado wrote:
> how can I use ruby to list all the html links on a site, ?

require 'open-uri'

def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.or...)

Dado

5/3/2006 11:26:00 PM

0

after running this code I get


:~$ ruby list.rb
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression

Jeffrey Schwab wrote:

> Dado wrote:
>> how can I use ruby to list all the html links on a site, ?
>
> require 'open-uri'
>
> def scrape(url)
> open(url) do |uri|
> href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
> m = href.match(uri.read)
> while m
> puts m[1]
> m = href.match(m.post_match)
> end
> end
> end
>
> scrape('http://www.ruby-lang.or...)

Jeff Schwab

5/4/2006 12:16:00 AM

0

Dado wrote:
> after running this code I get
>
>
> :~$ ruby list.rb
> list.rb:5: Invalid char `\302' in expression
> list.rb:5: Invalid char `\240' in expression
> list.rb:5: Invalid char `\302' in expression
> list.rb:5: Invalid char `\240' in expression
> list.rb:5: Invalid char `\302' in expression

....

> Jeffrey Schwab wrote:
>
>> Dado wrote:
>>> how can I use ruby to list all the html links on a site, ?
>> require 'open-uri'
>>
>> def scrape(url)
>> open(url) do |uri|
>> href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
>> m = href.match(uri.read)
>> while m
>> puts m[1]
>> m = href.match(m.post_match)
>> end
>> end
>> end
>>
>> scrape('http://www.ruby-lang.or...)
>

Please don't:
- top-post
- post eighty lines of virtually identical error messages

Please do:
- post the exact content of your source file, list.rb

If your ruby code is exactly what I posted, then perhaps the file got
corrupted before the ruby interpreter saw it. Are you sitting at the
machine where ruby is running, or are you transferring the file contents
via (e.g.) telnet or ftp?

Jeff Schwab

5/4/2006 12:17:00 AM

0

Dado wrote:
> after running this code I get
>
> :~$ ruby list.rb
> list.rb:5: Invalid char `\302' in expression
> list.rb:5: Invalid char `\240' in expression
> list.rb:5: Invalid char `\302' in expression
> list.rb:5: Invalid char `\240' in expression
> list.rb:5: Invalid char `\302' in expression

....

> Jeffrey Schwab wrote:
>
>> Dado wrote:
>>> how can I use ruby to list all the html links on a site, ?
>> require 'open-uri'
>>
>> def scrape(url)
>> open(url) do |uri|
>> href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
>> m = href.match(uri.read)
>> while m
>> puts m[1]
>> m = href.match(m.post_match)
>> end
>> end
>> end
>>
>> scrape('http://www.ruby-lang.or...)
>

Please don't:
- top-post
- post eighty lines of virtually identical error messages

Please do:
- post the exact content of your source file, list.rb

If your ruby code is exactly what I posted, then perhaps the file got
corrupted before the ruby interpreter saw it. Are you sitting at the
machine where ruby is running, or are you transferring the file contents
via (e.g.) telnet or ftp?

Jeff Schwab

5/4/2006 12:23:00 AM

0

Jeffrey Schwab wrote:
> Dado wrote:
> > after running this code I get
> >
> > :~$ ruby list.rb
> > list.rb:5: Invalid char `\302' in expression
> > list.rb:5: Invalid char `\240' in expression
> > list.rb:5: Invalid char `\302' in expression
> > list.rb:5: Invalid char `\240' in expression
> > list.rb:5: Invalid char `\302' in expression

....

> Please don't:
> - top-post
> - post eighty lines of virtually identical error messages

- give an apparently valid, but non-functional,
email address.

anne001

5/5/2006 1:15:00 PM

0

require 'open-uri'
def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.or...)
works for me

regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
what is it saying? \s is space, () retrieves a group...[]identifies
character sets...

how does the loop work?
I found post_match, programming ruby page 538

I put some puts
first time around
m and m[1]
href="mailto:webmaster@ruby-lang.org"
"mailto:webmaster@ruby-lang.org"
why is the second line m[1]...? Is it because of the set of
parenthesis?

thanks for your help

Ross Bamford

5/5/2006 2:25:00 PM

0

On Wed, 03 May 2006 22:27:35 +0100, Dado <digi@lycos.com> wrote:

> how can I use ruby to list all the html links on a site, ?
>

An alternative to the regexp approach, if you don't mind using external
libraries:

require 'open-uri'
require 'rubyful_soup' # [1]
page = BeautifulSoup.new(URI('http://ruby-lan...).read)
page.find_all('a').each { |l| puts l['href'] }

require 'mechanize' # [2]
m = WWW::Mechanize.new
page = m.get('http://ruby-lan...)
page.links.each { |l| puts l.href }

--
[1] http://www.crummy.com/software/Ru...
[2] http://mechanize.ruby...

Ross Bamford - rosco@roscopeco.remove.co.uk

Vincent Foley

5/5/2006 6:16:00 PM

0

require 'open-uri'
URI.extract(open(<url>).read)

Ross Bamford

5/5/2006 6:39:00 PM

0

On Fri, 05 May 2006 19:16:05 +0100, Vincent Foley <vfoley@gmail.com> wrote:

> require 'open-uri'
> URI.extract(open(<url>).read)
>

Unfortunately, you pull a lot of false positives, and it doesn't
differentiate between links and other uris (e.g. link src elements, DTD
refs, etc).

pp URI.extract(URI('http://www.googl...).read)
["font-family:arial,sans-serif;",
"font-size:",
"color:#0000cc;",
"http://www.google.co.uk/ig%3Fhl%...,
"https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&h...,
"http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=U...,
"http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=U...,
"http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=U...,
"Search:",
"http://www.google.com...]


--
Ross Bamford - rosco@roscopeco.remove.co.uk

Jeff Schwab

5/5/2006 8:27:00 PM

0

require 'open-uri'
def scrape(url)
open(url) do |uri|
href = /
href # The relevant attribute of each link tag.
\s*=\s* # Equal sign, with optional whitespace.
( # Capture
"(.*?)" # arbitrary double-quoted text
| # or
[^>\s] # everything up to next space or '>'.
) # Stop capturing.
/x

# Slurp in the whole text. See whether the pattern matches.
m = href.match(uri.read)

# While there's a match, print the captured text, and search
# the remainder of the text.
while m
puts m[1]
m = href.match(m.post_match)
end
end
end


scrape('http://www.ruby-lang.or...)
__END__

> works for me
>
> regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
> what is it saying? \s is space, () retrieves a group...[]identifies
> character sets...

You've got it. I simplified a little & added comments in the above
example.

> how does the loop work?

1. Find the first match (if any).
2. Print the captured text.
3. Search again, but only the post_match: the text that hasn't been
searched yet.

> I found post_match, programming ruby page 538

I found it using ri, which has been a real time-saver for me: ri
Regexp#match told me about MatchData, and ri MatchData told me about
post_match.

> I put some puts first time around m and m[1]
>
> href="mailto:webmaster@ruby-lang.org"
> "mailto:webmaster@ruby-lang.org"
>
> why is the second line m[1]...? Is it because of the set of
> parenthesis?

Yes. m[1] gives the first capture group, m[2] would give the second,
etc. m[0] gives the entire matched text, like $&.