[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

how come this code doesnt work as designed?

dtown22

2/19/2008 1:12:00 AM

Hi,

I found this web crawler code online using mechanize,

require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://example...)

stack = page.links
counter = 0;

out = File.open("out.txt", "w")
while l = stack.pop
begin
next unless l.uri.host == agent.history.first.uri.host
if not agent.visited? l.href
counter += 1
out.puts l.href
stack.push(*(agent.click(l).links))
end
rescue
#puts "Error encountered"
end
end

puts "Total unique links: " + counter.to_s

So I gave it a try, and although it seemed to be working, I noticed that
the stack size quickly rocketed, and after examining the output, I
noticed that there are several duplicates (For example, one output file
had over 50k urls, but when I removed the duplicates, there was only a
bit over 9k urls). So I modified the code using a Hash to avoid
duplicates (although this design means that I am storing multiple copies
of all the urls), but the same thing happened, so I was wondering if
anyone could figure out what I am doing wrong. Here is the modified
code:

require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://example...)

stack = page.links
hash = Hash.new
counter = 0;

out = File.open("out.txt", "w")
while l = stack.pop
begin
next unless l.uri.host == agent.history.first.uri.host
if not agent.visited? l.href
counter += 1
out.puts "url:1 " + l.href
agent.click(l).links.each do |link|
if(hash[link] == nil)
hash.store(link,link)
stack.push(link)
end
end
#stack.push(*(agent.click(l).links))
end
rescue
#puts "Error encountered"
end
end

puts "Total unique links: " + counter.to_s

Note: I am aware that crawling sites at random is not accepted, and this
script is not intended for that, I am crawling personal sites
--
Posted via http://www.ruby-....