Asp Forum - spidering a website to build a sitemap

Bill Guindon

6/22/2005 6:50:00 PM

I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?

--
Bill Guindon (aka aGorilla)

16 Answers

Ryan Leavengood

6/22/2005 7:19:00 PM

Bill Guindon said:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?

You could do this with WWW::Mechanize fairly easily. There isn't a
built-in spider system yet, but it would be a nice addition and I'm sure
Michael would add it if it was general enough.

I'm pretty familiar with Mechanize now and could help you out if you have
a problem. The basic idea would to recursively get pages and click links
until you run out of links (of course I'm sure you know this.) The cool
thing is turning that idea into code with Mechanize is very easy, since it
collects links for you, and allow you to "click" them.

In case you can't tell, I really like Mechanize :)

Ryan

Bill Guindon

6/22/2005 8:08:00 PM

On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:
> Bill Guindon said:
> > I need to spider a site and build a sitemap for it. I've looked
> > around on rubyforge, and RAA, and don't see an exact match. Has
> > anybody done this, or is there a library out there that I missed?
>
> You could do this with WWW::Mechanize fairly easily. There isn't a
> built-in spider system yet, but it would be a nice addition and I'm sure
> Michael would add it if it was general enough.
>
> I'm pretty familiar with Mechanize now and could help you out if you have
> a problem. The basic idea would to recursively get pages and click links
> until you run out of links (of course I'm sure you know this.) The cool
> thing is turning that idea into code with Mechanize is very easy, since it
> collects links for you, and allow you to "click" them.

Grabbed it as a gem, trying a simple test. Oddly enough, had to add
it's lib path to the LOAD_PATH to get rid of an error (uninitialized
constant WWW (NameError)).

Any docs available on this, or any public examples? Does look like
it'll give me a good start.

thanks much for the pointer.

> In case you can't tell, I really like Mechanize :)
>
> Ryan
>

--
Bill Guindon (aka aGorilla)

Ryan Leavengood

6/22/2005 8:25:00 PM

Bill Guindon said:
>
> Grabbed it as a gem, trying a simple test. Oddly enough, had to add
> it's lib path to the LOAD_PATH to get rid of an error (uninitialized
> constant WWW (NameError)).

Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?

Mechanize does mess around with the LOAD_PATH itself because it uses new
features from the Ruby v1.9 net libraries.

But for me it worked fine, as shown in the code below.

> Any docs available on this, or any public examples? Does look like
> it'll give me a good start.

Unfortunately the docs are a bit light at the moment. I learned a lot by
reading the source though, which is well written. Once I get my web-site
up I was going to write an article on Mechanize, but for now that doesn't
help you much :)

It needs to be heavily refactored, but here is the prototype code I wrote
to help me renew books at my city library's web-site:

require 'mechanize'
require 'time'

class Book
attr_accessor :title, :author, :due_date, :checkbox

def due?
(@due_date - Time.now) < 172800.0 # 2 days
end

def to_s
"#@title by #@author, due on #@due_date\nCheckbox: #{checkbox.name}"
end
end

agent = WWW::Mechanize.new {|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.coala...)
link = page.links.find {|l| l.node.text =~ /BOYNTON/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /My Account/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /Renew My Materials/ }
page = agent.click(link)
form = page.forms[1]
form.fields.find {|f| f.name == 'user_id'}.value = 'my_id_removed'
form.fields.find {|f| f.name == 'password'}.value = 'my_password_removed'
agent.watch_for_set = {}
agent.watch_for_set['td']=nil
page = agent.submit(form, form.buttons.first)
form = page.forms[1]
books_html = page.watches['td'].find_all {|n| n.attributes['class'] =~
/itemlisting/}
books = []
books_html.each do |element|
element.each_element do |subelem|
if subelem.name == 'input' and subelem.attributes['type'] == 'checkbox'
# Checkbox for renewal
books << Book.new
books[-1].checkbox = form.checkboxes.find {|c| c.name ==
subelem.attributes['name']}
elsif subelem.name == 'label'
# Book title and author
books[-1].title = subelem.texts[0]
books[-1].author = subelem.texts[1]
elsif subelem.name == 'strong'
# Due date
books[-1].due_date = Time.parse(subelem.text)
end
end
end
books_due = false
books.each do |book|
if book.due?
books_due = true
puts "#{book.title} is due, renewing!"
book.checkbox.checked = true
end
end
if books_due
page = agent.submit(form, form.buttons.first)
puts page.body
else
puts 'Nothing was due, have a nice day!'
end
__END__

> thanks much for the pointer.

No problem. Hope the above code helps too.

Ryan

Bill Guindon

6/22/2005 10:06:00 PM

On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:
> Bill Guindon said:
> >
> > Grabbed it as a gem, trying a simple test. Oddly enough, had to add
> > it's lib path to the LOAD_PATH to get rid of an error (uninitialized
> > constant WWW (NameError)).
>
> Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?

Nope, guess it's time to add it.

> Mechanize does mess around with the LOAD_PATH itself because it uses new
> features from the Ruby v1.9 net libraries.
>
> But for me it worked fine, as shown in the code below.
>
> > Any docs available on this, or any public examples? Does look like
> > it'll give me a good start.
>
> Unfortunately the docs are a bit light at the moment. I learned a lot by
> reading the source though, which is well written. Once I get my web-site
> up I was going to write an article on Mechanize, but for now that doesn't
> help you much :)
>
> It needs to be heavily refactored, but here is the prototype code I wrote
> to help me renew books at my city library's web-site:

[helpful code snipped]

Thanks, that gives me a better idea of what can be done with it.

Now comes the fun part of parsing through relative urls, checking for
base href's, munging similar urls (ie: /some/file.html vs.
some/file.html both called from the root). Should be interesting.

> > thanks much for the pointer.
>
> No problem. Hope the above code helps too.
>
> Ryan

--
Bill Guindon (aka aGorilla)

Shad Sterling

6/23/2005 10:37:00 PM

I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.

http://sterfish.com/lab/s...

I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.

- Shad

On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?
>
> --
> Bill Guindon (aka aGorilla)
>
>

--

----------

Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.

Bill Guindon

6/24/2005 12:07:00 AM

On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:
> I have a site mapping tool I'm working on which does not yet read
> remote files but does map links between local files.
>
> http://sterfish.com/lab/s...
>
> I've been putting off announcing it until I have an actual page there,
> but I guess I'm too slow.

Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.

> - Shad
>
>
>
> On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> > I need to spider a site and build a sitemap for it. I've looked
> > around on rubyforge, and RAA, and don't see an exact match. Has
> > anybody done this, or is there a library out there that I missed?
> >
> > --
> > Bill Guindon (aka aGorilla)
> >
> >
>
>
> --
>
> ----------
>
> Please do not send personal (non-list-related) mail to this address.
> Personal mail should be sent to polyergic@sterfish.com.
>
>

--
Bill Guindon (aka aGorilla)

Shad Sterling

6/24/2005 5:55:00 AM

On 6/23/05, Bill Guindon <agorilla@gmail.com> wrote:
> On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:
> > I have a site mapping tool I'm working on which does not yet read
> > remote files but does map links between local files.
> >
> > http://sterfish.com/lab/s...
> >
> > I've been putting off announcing it until I have an actual page there,
> > but I guess I'm too slow.
>
> Thanks much. I need one that works remotely, but I'll certainly poke
> around in there, and see what I can do with it.
>

Yeah. I made this to help me work on a site I'm now maintaining,
which was a hideous mess when I got to it. I do plan to make it map
remote pages as well, but it will probably be awhile.

>
> > - Shad
> >
> >
> >
> > On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> > > I need to spider a site and build a sitemap for it. I've looked
> > > around on rubyforge, and RAA, and don't see an exact match. Has
> > > anybody done this, or is there a library out there that I missed?
> > >
> > > --
> > > Bill Guindon (aka aGorilla)
> > >
> > >
> >
> >
> > --
> >
> > ----------
> >
> > Please do not send personal (non-list-related) mail to this address.
> > Personal mail should be sent to polyergic@sterfish.com.
> >
> >
>
>
> --
> Bill Guindon (aka aGorilla)
>
>

--

----------

Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.

Belorion

6/29/2005 8:08:00 PM

I'll throw my little snippet in, in case anyone finds it useful.

I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.

Not terribly advanced, but gives you the format of:

http://www.blah.co...
{tab} http://www.blah.co...

Where tabbed out children of the foo.html are pages foo.html points to.

http://snippets.textdrive.com/pos...

-Matt

Bill Guindon

6/29/2005 8:46:00 PM

On 6/29/05, Belorion <belorion@gmail.com> wrote:
> I'll throw my little snippet in, in case anyone finds it useful.
>
> I just wrote this up to spider my rails app to give me a list of all
> the urls so I can use them later in a stress test.
>
> Not terribly advanced, but gives you the format of:
>
> http://www.blah.co...
> {tab} http://www.blah.co...
>
> Where tabbed out children of the foo.html are pages foo.html points to.
>
> http://snippets.textdrive.com/pos...

Good stuff! It's missing a couple of features for stock sites
(handling javascript:, mailto:, #name links etc.), but those can
easily be added.

Thanks much for posting it.

> -Matt
>
>

--
Bill Guindon (aka aGorilla)

Gene Tani

6/30/2005 12:30:00 AM

i noticed webfetcher in RPAbase, haven't had a chance to play with it:

http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/webfe...

comp.lang.ruby

spidering a website to build a sitemap

Bill Guindon

Ryan Leavengood

Bill Guindon

Ryan Leavengood

Bill Guindon

Shad Sterling

Bill Guindon

Shad Sterling

Belorion

Bill Guindon

Gene Tani

x Login to ForumsZone