Asp Forum - scraping web pages for cisco products

Charles Pareto

9/19/2007 5:43:00 PM

I submitted a post a few days ago about scraping the web for Cisco
products. I didn't receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldb...
http://www.cisc...
http://www.ciscobo...
http://www.cis...

and I want to scrape through them and determine which websites are
selling new cisco products, I'm actually looking for around 20 or so
products (ex. WIC-1T, NM-4E, WS-G2950-24). One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don't know how to handle after that. Does
anyone have a different/better approach? Any help would be appreciated.
--
Posted via http://www.ruby-....

11 Answers

Konrad Meyer

9/19/2007 6:13:00 PM

Quoth Glen Holcomb:
> On 9/19/07, Glen Holcomb <damnbigman@gmail.com> wrote:
> >
> > On 9/19/07, Chuck Dawit <chuckdawit@gmail.com> wrote:
> > >
> > >
> > >
> > > I submitted a post a few days ago about scraping the web for Cisco
> > > products. I didn't receive that much input so I thought I would ask
> > > again. Here are the requirments. I have a list of 2000 urls that all
> > > have Cisco in its domain name.
> > > (ex. http://www.soldb...
> > > http://www.cisc...
> > > http://www.ciscobo...
> > > http://www.cis...
> > >
> > > and I want to scrape through them and determine which websites are
> > > selling new cisco products, I'm actually looking for around 20 or so
> > > products (ex. WIC-1T, NM-4E, WS-G2950-24). One idea I was given was to
> > > split the pages into ones with forms and those without forms. Those
> > > without forms probably wont have anything for sale so I can eliminate
> > > those. But then I really don't know how to handle after that. Does
> > > anyone have a different/better approach? Any help would be appreciated.
> > > --
> > > Posted via http://www.ruby-... .
> > >
> > >
> > Not to make your problem worse but you will need to differentiate between
> > new and used equipment too.
> >
> > --
> > "Hey brother Christian with your high and mighty errand, Your actions
> > speak so loud, I can't hear a word you're saying."
> >
> > -Greg Graffin (Bad Religion)
>
>
> I don't remember who but someone suggested using Froogle and parsing that
> output. Froogle and a few other sites like Pricewatch might be a far less
> complicated approach, you won't find all of them but then again I don't
> think you can possibly find everything anyway.
>
> --
> "Hey brother Christian with your high and mighty errand, Your actions speak
> so loud, I can't hear a word you're saying."
>
> -Greg Graffin (Bad Religion)

That was me. Seems to me you shouldn't parse froogle so much as just use it.
Writing a script is a lot more work and won't get you what you want; froogle
will.

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

Charles Pareto

9/19/2007 6:50:00 PM

Konrad Meyer wrote:
> Quoth Glen Holcomb:
>> > > (ex. http://www.soldb...
>> > > anyone have a different/better approach? Any help would be appreciated.
>> >
>> so loud, I can't hear a word you're saying."
>>
>> -Greg Graffin (Bad Religion)
>
> That was me. Seems to me you shouldn't parse froogle so much as just use
> it.
> Writing a script is a lot more work and won't get you what you want;
> froogle
> will.

But see I need to use only the list that I have with Cisco in the domain
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
website names like the ones I have?
--
Posted via http://www.ruby-....

Konrad Meyer

9/19/2007 6:59:00 PM

Quoth Chuck Dawit:
> Konrad Meyer wrote:
> > Quoth Glen Holcomb:
> >> > > (ex. http://www.soldb...
> >> > > anyone have a different/better approach? Any help would be
appreciated.
> >> >
> >> so loud, I can't hear a word you're saying."
> >>
> >> -Greg Graffin (Bad Religion)
> >
> > That was me. Seems to me you shouldn't parse froogle so much as just use
> > it.
> > Writing a script is a lot more work and won't get you what you want;
> > froogle
> > will.
>
> But see I need to use only the list that I have with Cisco in the domain
> name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
> website names like the ones I have?

Assuming it uses a similar interface to google (I don't know much about it),
yes, "site:usedcisco.com" etc.

Why do you need the list? Just search for anything below 60% MSRP, and ANY
website selling counterfeit cisco devices should come up.

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

Charles Pareto

9/19/2007 7:18:00 PM

Glen Holcomb wrote:
> On 9/19/07, Chuck Dawit <chuckdawit@gmail.com> wrote:
>> >
>> Posted via http://www.ruby-....
>>
>>
> Why is the domain important if you are looking for fraudulent equipment
> based on selling price? I don't think you can search by url, I don't
> see
> why anyone looking for a specific product would need to do that.
>
> --
> "Hey brother Christian with your high and mighty errand, Your actions
> speak
> so loud, I can't hear a word you're saying."
>
> -Greg Graffin (Bad Religion)

I'm looking for copywright infrigment on Cisco's name 2. So I'm not only
looking for those companies that are selling Cisco counterfeit equipment
but also those who are infringing on Cisco's name as well.
--
Posted via http://www.ruby-....

brabuhr

9/19/2007 8:25:00 PM

On 9/19/07, Chuck Dawit <chuckdawit@gmail.com> wrote:
> One idea I was given was to
> split the pages into ones with forms and those without forms. Those
> without forms probably wont have anything for sale so I can eliminate
> those. But then I really don't know how to handle after that.

Here's a naive implementation of binning by forms:

> cat sites
www.cnn.com
www.usedcisco.com
www.rubyforge.org
slashdot.org
technocrat.net
bk.com

> cat firstbin.rb
#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

sites = File.readlines("sites")
bin1 = []
bin2 = []
bin3 = []

sites.each do |site|
site.chomp!

page = agent.get "http://#{site}"
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

p bin1
p bin2
p bin3

> ruby firstbin.rb
["www.cnn.com", "www.rubyforge.org", "slashdot.org"]
["www.usedcisco.com", "technocrat.net"]
["bk.com"]

Charles Pareto

9/19/2007 11:32:00 PM

brabuhr

9/20/2007 1:24:00 AM

> With this method do I need to know the name of the form to use it? With
> mechanize I thought you had to look at the form name first before you
> could use it?

It helps to know someway to distinguish the form you're looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don't see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

Charles Pareto

9/20/2007 5:37:00 AM

unknown wrote:
>> With this method do I need to know the name of the form to use it? With
>> mechanize I thought you had to look at the form name first before you
>> could use it?
>
> It helps to know someway to distinguish the form you're looking for
> from the other forms on the page. It would be possible to iterate
> through all the forms on a page, entering some text into the text
> fields in the form and submitting them; but, most of the time the
> script would probably be in either the wrong form or the wrong field
> in the right form (and, of course, there are other issues, e.g. forms
> that require multiple fields to be edited). I don't see anyway to
> avoid customizing the code for each site (though, if you get a good
> framework built the effort per site should decrease?).

I agree but I have around 2000 sites to look at and I can't look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I've never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?
--
Posted via http://www.ruby-....

Brad Phelan

9/20/2007 7:29:00 AM

Chuck Dawit wrote:
> unknown wrote:
>>> With this method do I need to know the name of the form to use it? With
>>> mechanize I thought you had to look at the form name first before you
>>> could use it?
>> It helps to know someway to distinguish the form you're looking for
>> from the other forms on the page. It would be possible to iterate
>> through all the forms on a page, entering some text into the text
>> fields in the form and submitting them; but, most of the time the
>> script would probably be in either the wrong form or the wrong field
>> in the right form (and, of course, there are other issues, e.g. forms
>> that require multiple fields to be edited). I don't see anyway to
>> avoid customizing the code for each site (though, if you get a good
>> framework built the effort per site should decrease?).
>
> I agree but I have around 2000 sites to look at and I can't look at each
> and every form, that would take way to long. Do you think a better
> approach would be to use a search engines API to search for the products
> on each site? I've never used any search engine API, if I know the
> website name and the product name and a price I want can I use those
> parameters in the search to find results?

This query seems to work

site:solecentral.com.au OR site:xtargets.com AND crocs

I advertise my brothers e-commerce site on my site and they both contain
the same keyword "crocs". Google returns all the pages from my site and
his site that contain the word "crocs". However I am not sure how high
the query scales as I think Google truncates the search string after
some length so adding in 2000 sites for the query string might break.

Not sure if the same query trick also works in froogle as well as
vanilla google.

Hope this is somewhat helpful.

--
Brad
http://xt...

Charles Pareto

9/20/2007 8:26:00 PM

unknown wrote:
> On 9/19/07, Chuck Dawit <chuckdawit@gmail.com> wrote:
>> One idea I was given was to
>> split the pages into ones with forms and those without forms. Those
>> without forms probably wont have anything for sale so I can eliminate
>> those. But then I really don't know how to handle after that.
>
> Here's a naive implementation of binning by forms:
>

>
> page = agent.get "http://#{site}"
> forms = page.forms
> search_forms = forms.select{|f|
> (f.name and f.name.match /search/i) or
> (f.action and f.action.to_s.match /search/i)
> }
>
> if search_forms.size > 0
> bin1 << site
> elsif forms.size > 0
> bin2 << site
> else
> bin3 << site
> end
> end
>

I'm checking the size of the form like in the code above but when it
gets to the 13th url to check the script just exits. Does anyone know
why? How can I run a check on this?
--
Posted via http://www.ruby-....

comp.lang.ruby

scraping web pages for cisco products

Charles Pareto

Konrad Meyer

Charles Pareto

Konrad Meyer

Charles Pareto

brabuhr

Charles Pareto

brabuhr

Charles Pareto

Brad Phelan

Charles Pareto

x Login to ForumsZone