Asp Forum - Re: Problem with getting info from several websites

George Malamidis

6/7/2007 5:12:00 PM

Hi,

Is something like this what you have in mind?

doc = Hpricot(open("http://www.securityfocus.com/bid...))
p (doc/'#vulnerability')

George

On 7 Jun 2007, at 16:28, Tom Bombadil wrote:

> Hi there,
>
> The code below provides me the html for a specific id of the site:
> www.securityfocus.com What I'm trying to do is: Getting the info of
> the div
> id ="vulnerability" only, but for all the different id's available -
> currently around 25000. I think it is something like: next_page
> 'Next'>',
> :limit => 25000 but where do I need to put it and how can I get the
> div id
> info only? I appreciate your help.
>
> require 'rubygems'
> require 'hpricot'
> require 'open-uri'
>
> # load the Securityfocus home page (id 715 to start)
> doc = Hpricot(open("http://www.securityfocus.com/bid...))
>
> # print the altered HTML
> puts doc
>
>
> -tom

6 Answers

Peter Szinek

6/8/2007 8:03:00 AM

Tom Bombadil wrote:
> George,
>
> Thanks, however p (doc/'#vulnerability') still delivers me the whole
> site...
> i only need the content of the div id = "vulnerability". How to proceed
> with
> that?

Does this solve your problem?

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://www.securityfocus.com/bid...))
p doc/"div[@id='vulnerability']"

If you don't want to scrape the table further, then the above solution
should be enough - but if you want to go on and drill down the table,
you could check out scRUBYt! (http://s...), a Ruby web scraping
tool which is designed to handle such issues.

Cheers,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby.

George Malamidis

6/8/2007 8:09:00 AM

Hi,

That's odd. If I do something like:

v = doc/'#vulnerability'

p v.to_html
p v.inner_html
p.inner_text

I only see results relevant to the content of the 'vulnerability' div
(no header, banners, navigation, etc). Maybe if you save your results
to a file it would be easier to inspect by looking at them?

George

On 8 Jun 2007, at 08:32, Tom Bombadil wrote:

> George,
>
> Thanks, however p (doc/'#vulnerability') still delivers me the
> whole site...
> i only need the content of the div id = "vulnerability". How to
> proceed with
> that?
>
> So long,
> Tom

Peter Szinek

6/8/2007 8:47:00 AM

Sorry, I have meant

p doc/"//div[@id='vulnerability']"

Cheers,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby.

Peter Szinek

6/8/2007 11:33:00 AM

Tom.

Man, you are mixing pure Hpricot and scRUBYt! together - This syntax:

> securityfocus_data.to_xml.write($stdout, 1)

is from scRUBYt!, but you are gathering the data with Hpricot - how
would you like to pull this off?
Maybe I don't get something, but I am a bit confused...

Cheers,
Peter
__
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby.

George Malamidis

6/10/2007 12:37:00 PM

Hello,

Does this work for you?

%w(rubygems hpricot open-uri).each { |e| require e }

(1..10).each do |id|
doc = Hpricot(open("http://www.securityfocus...{id}"))
p (doc/'#vulnerability').inner_html
end

George

On 10 Jun 2007, at 10:26, Tom Bombadil wrote:

> Hi there,
>
> I'm trying to get data from a couple of websites at the same time
> by using:
>
> (1..10).each { |p| print p} # should get the content from the
> pages with
> the id's 1 to 10...
>
> # load the Securityfocus home page (id 1 to start)
> doc = Hpricot(open("http://www.securityfocus.com/bid/1&...
> www.securityfocus.com/bid/715>
> "))
>
> # get the content of the div id ="vulnerability"
> p =(doc/'#vulnerability').inner_html
>
> # prints div id = 'vulnerability'
> puts p
>
> Must I use an array?
>
> Thanks,
> Tom

George Malamidis

6/11/2007 3:41:00 PM

Hello,

(1..10).each do |id|
doc = Hpricot(open("http://www.securityfocus...{id}"))
File.open("#{id}.txt", "w") do |f|
f << (doc/'#vulnerability').inner_html
end
end

Should write every page to a different file.

You can use stuff like gsub to replace/remove any unwanted characters.

George

On 11 Jun 2007, at 13:44, Tom Bombadil wrote:

> Indeed, I appreciate!! I have actually 2 more questions regarding this
> topic. Thus,
> 1) I'd like to either create a .txt or .xml file for each id. Can I
> go ahead
> and use aFile = File.new("bid.txt", "w") and aFile.close?
> 2) How can I chomp the \n and \t characters in the terminal output?
> I only
> need the text w/o tags. Best on separate lines...
> I owe you a pitcher :-)

comp.lang.ruby

Re: Problem with getting info from several websites

George Malamidis

Peter Szinek

George Malamidis

Peter Szinek

Peter Szinek

George Malamidis

George Malamidis

x Login to ForumsZone