[lnkForumImage]

TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.

Asp Forum
Home | Login | Register | Search

Forums >

comp.lang.ruby

Page crawling and URL grabbing

Patrick L.

1/27/2009 12:55:00 AM

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse...) do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!
--
Posted via http://www.ruby-....

4 Answers

Jesús Gabriel y Galán

1/27/2009 8:36:00 AM

0

On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. <leahy16@gmail.com> wrote:
> Hey guys,
> I'm trying to write an application that goes onto a website (istockphoto
> specifically), opens up istockphoto.com/file_browse.php and grabs the
> URLs of the photos that appear there.
>
> It's my first time doing something like this. I'm reading some
> documentation right now...but a hand would be greatly appreciated. I'm
> not really sure how to do regex on an html file...or even find the right
> stuff within that file. I'm guessing its..

Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend taking
a look at the above tools:

require 'open-uri'
require 'hpricot'

h = Hpricot(open("http://www.istockphoto.com/file_browse...))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img["src"]}

# => ["http://www2.istockphoto.com/file_thumbview_approve/8137463/1/istockphoto_8137463-budapest-by-night...,
"http://www2.istockphoto.com/file_thumbview_approve/8139472/1/istockphoto_8139472-four-antique-wood-tennis-racquets...,
"http://www2.istockphoto.com/file_thumbview_approve/6731990/1/istockphoto_6731990-two-female-lovers...,
"http://www2.istockphoto.com/file_thumbview_approve/8308377/1/istockphoto_8308377-beauty...,
"http://www2.istockphoto.com/file_thumbview_approve/6349299/1/istockphoto_6349299-lovers-interested-in-smth...,
"http://www2.istockphoto.com/file_thumbview_approve/8322403/1/istockphoto_8322403-happy-piggy-bank...,
"http://www2.istockphoto.com/file_thumbview_approve/8138976/1/istockphoto_8138976-tower-guard-of-cetara-little-town-in-amalfi-coast-italy...,
"http://www2.istockphoto.com/file_thumbview_approve/8322394/1/istockphoto_8322394-yellow-red-paper...,
"http://www1.istockphoto.com/file_thumbview_approve/4660654/1/istockphoto_4660654-the-art-of-eye-shadows...,
"http://www1.istockphoto.com/file_thumbview_approve/8301075/1/istockphoto_8301075-3d-render-of-the-olive-tree...,
"http://www1.istockphoto.com/file_thumbview_approve/6921717/1/istockphoto_6921717-manicure...,
"http://www2.istockphoto.com/file_thumbview_approve/8322391/1/istockphoto_8322391-pomegranate...,
"http://www2.istockphoto.com/file_thumbview_approve/8138975/1/istockphoto_8138975-junger-mann-seitlich...,
"http://www2.istockphoto.com/file_thumbview_approve/8139815/1/istockphoto_8139815-winter...,
"http://www2.istockphoto.com/file_thumbview_approve/8137153/1/istockphoto_8137153-beadworkafrican_pictureframe_p3406-jpg...,
"http://www2.istockphoto.com/file_thumbview_approve/8139787/1/istockphoto_8139787-statue-of-liberty...,
"http://www2.istockphoto.com/file_thumbview_approve/8322388/1/istockphoto_8322388-cold-winter-day...,
"http://www2.istockphoto.com/file_thumbview_approve/8139602/1/istockphoto_8139602-statue-of-liberty...,
"http://www2.istockphoto.com/file_thumbview_approve/8137801/1/istockphoto_8137801-litchi...,
"http://www2.istockphoto.com/file_thumbview_approve/8139406/1/istockphoto_8139406-statue-of-liberty...,
"http://www1.istockphoto.com/file_thumbview_approve/6850893/1/istockphoto_6850893-polka-dot-wedding-cake...,
"http://www2.istockphoto.com/file_thumbview_approve/8139802/1/istockphoto_8139802-snow-woman...,
"http://www2.istockphoto.com/file_thumbview_approve/8322364/1/istockphoto_8322364-white-cherry-blossom...,
"http://www2.istockphoto.com/file_thumbview_approve/8139808/1/istockphoto_8139808-airport...,
"http://www2.istockphoto.com/file_thumbview_approve/8322357/1/istockphoto_8322357-ciruit...,
"http://www2.istockphoto.com/file_thumbview_approve/8139597/1/istockphoto_8139597-cheese-and-wine...,
"http://www2.istockphoto.com/file_thumbview_approve/8138075/1/istockphoto_8138075-employee-of-office...]

You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.

Miroslaw Niegowski

1/27/2009 8:39:00 AM

0

2009/1/27 Patrick L. <leahy16@gmail.com>:
> Hey guys,
> I'm trying to write an application that goes onto a website (istockphoto
> specifically), opens up istockphoto.com/file_browse.php and grabs the
> URLs of the photos that appear there.
>
> It's my first time doing something like this. I'm reading some
> documentation right now...but a hand would be greatly appreciated. I'm
> not really sure how to do regex on an html file...or even find the right
> stuff within that file. I'm guessing its..
>
> open('http://www.istockphoto.com/file_browse...) do |f|
> f.find # dot something something
> end

Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_brows...);
page.links.text(/jpg/)
...

Patrick L.

1/27/2009 11:41:00 PM

0

Miroslaw Niegowski wrote:
> 2009/1/27 Patrick L. <leahy16@gmail.com>:
>> open('http://www.istockphoto.com/file_browse...) do |f|
>> f.find # dot something something
>> end
>
>
> Try Mechanize.
> It's easy :
>
> agent = WWW::Mechanize.new
> agent.user_agent_alias='Mac Safari'
> page = agent.get('http://www.istockphoto.com/file_brows...);
> page.links.text(/jpg/)
> ...

That's great, or it sounds great. Is there any documentation aside from
blog posts and this: http://mechanize.rubyforge.org/... ? What
did you use to learn it?

--
Posted via http://www.ruby-....

Tsunami Script

1/27/2009 11:45:00 PM

0

mechanize is very easy and intuitive ... you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you're good to go .
--
Posted via http://www.ruby-....

x Login to ForumsZone