[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

screen scraping programaticprogrammer.com ?

7stud 7stud

9/13/2007 12:40:00 PM

The following is from "Programming Ruby 2nd" p.133:

----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

if response.message == "OK"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
----

It doesn't work: nothing is printed. So, I modified it a little:

-----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

puts response.message
puts response.code

if response.message == "OK"
puts "*"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
-----

and the output was:

Found
302


I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?
--
Posted via http://www.ruby-....

3 Answers

Ronald Fischer

9/13/2007 12:47:00 PM

0

> h = Net::HTTP.new("www.programaticprogrammer.com", 80)
> response = h.get("/index.html")
>
> puts response.message
> puts response.code
>
> if response.message == "OK"
> puts "*"
> puts response.body.scan(/<img src="(.*?)"/m).uniq
> end
> -----
>
> and the output was:
>
> Found
> 302
>
>
> I clicked a link on their home page and tried to access the page that
> was displayed, but I got the same result. What am I doing wrong?

Wrong URL. How about using www.pragmaticprogrammer.com instead?

I think the prOGRamatic programmers are slowly dying out anyway in
favour of the pragmatic programmers.... ;-)

Ronald
--
Ronald Fischer <ronald.fischer@venyon.com>
Phone: +49-89-452133-162

7stud 7stud

9/13/2007 12:54:00 PM

0

Ronald Fischer wrote:
> Wrong URL. How about using www.pragmaticprogrammer.com instead?
>

Whoops. Thanks.
--
Posted via http://www.ruby-....

John Joyce

9/14/2007 3:58:00 PM

0

Just remember that with screen scraping, you are anticipating a file
served by a file server, on top of that you are generally
anticipating a very particular structure in that document. Web sites
change frequently and without notice and even the smallest changes
can blow out your scraper. So be very careful to inspect the various
pages of sites you plan to scrape, and then try to write your scraper
to check for things and not fail if it isn't found.

With some clever programming and a little knowledge of the site, you
can make a simple but smart scraper. However, it will still be pretty
fragile. html/xhtml is just too loose and human-language like, full
of ambiguity and implicit meaning that humans would get, but machines
would work hard to fail at.