[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Screen Scraping Advice

Charles Pareto

9/17/2007 5:25:00 PM

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?
--
Posted via http://www.ruby-....

11 Answers

John Joyce

9/17/2007 6:36:00 PM

0


On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:

> I work for Cisco Systems in San Jose Ca. I proposed a project to
> perform
> a screen scrape/spider hack to go out and look for websites with the
> Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
> etc.) and see if those companies are selling Cisco equipment. I
> want to
> look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
> websites and see if they are being sold for under 60% of their
> MSRP. We
> are trying to track down companies that are selling counterfeit
> equipment. So I started by downloading the DNS list of all domain
> names
> so I could read through that and extract all domain names with
> Cisco in
> it. Once I do that I want to go to each page and search/scrape for
> these
> products, but I don't really know the best approach to take. Can
> anyone
> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?
> --
> Posted via http://www.ruby-....
>
Doesn't sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.


Charles Pareto

9/17/2007 6:53:00 PM

0

John Joyce wrote:
> On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:
>
>> equipment. So I started by downloading the DNS list of all domain
>> Posted via http://www.ruby-....
>>
> Doesn't sound like much scraping, just searching text for a string.
> You could even do a lot of that work with Google.
> but just download the file and search for a string. create a data
> file of your own that tells you what line you found the string.
> Scraping is really for getting data from other sites, using the DOM
> structure they have to get (for example) the weather report.


Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won't show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.
--
Posted via http://www.ruby-....

Konrad Meyer

9/17/2007 7:27:00 PM

0

Quoth Chuck Dawit:
> John Joyce wrote:
> > On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:
> >
> >> equipment. So I started by downloading the DNS list of all domain
> >> Posted via http://www.ruby-....
> >>
> > Doesn't sound like much scraping, just searching text for a string.
> > You could even do a lot of that work with Google.
> > but just download the file and search for a string. create a data
> > file of your own that tells you what line you found the string.
> > Scraping is really for getting data from other sites, using the DOM
> > structure they have to get (for example) the weather report.
>
>
> Well, I disagree. Once I have all the websites with Cisco in its domain
> name and I look through them, there are lots of pages that won't show me
> info unless I do a search within that page itself. (ex. usedcisco.com)
> To search for specific items on this website I would have to use the
> search bar located within its page to search for say "WIC-1T" and then
> search for a price below a specific amount for that item.

Do a search on froogle for "cisco productname" with the max price set at
60% MSRP. Should turn up a few hits.

HTH,
--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

John Joyce

9/17/2007 7:44:00 PM

0


On Sep 17, 2007, at 1:52 PM, Chuck Dawit wrote:

> John Joyce wrote:
>> On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:
>>
>>> equipment. So I started by downloading the DNS list of all domain
>>> Posted via http://www.ruby-....
>>>
>> Doesn't sound like much scraping, just searching text for a string.
>> You could even do a lot of that work with Google.
>> but just download the file and search for a string. create a data
>> file of your own that tells you what line you found the string.
>> Scraping is really for getting data from other sites, using the DOM
>> structure they have to get (for example) the weather report.
>
>
> Well, I disagree. Once I have all the websites with Cisco in its
> domain
> name and I look through them, there are lots of pages that won't
> show me
> info unless I do a search within that page itself. (ex. usedcisco.com)
> To search for specific items on this website I would have to use the
> search bar located within its page to search for say "WIC-1T" and then
> search for a price below a specific amount for that item.
> --
> Posted via http://www.ruby-....
>
What I mean is, scraping usually relies on the document's structure
in some way. Without looking at the structure that a give site uses
(a given page if it isn't a templated dynamically generated page)
there is no way to know what corresponds to what. Page structure is
pretty arbitrary. Presentation and structure don't necessarily
correspond well, or in a way you could guess.
Ironically, the better their web designers, the easier it will be.

But if you are talking about searching a dynamically generated site,
you still have to find out if it has a search mechanism, what does it
call the form field and submit buttons? The names in html can be
arbitrary, especially if they use graphic buttons.

If you have long list of products to search for, you will still save
yourself some work, but scraping involves some visual inspection of
pages and page source to get things going. Be aware that their
sysadmin may spot you doing a big blast of searches all at once and
block you from the site. If they check their logs and see that
somebody is searching for all cisco stuff, in an automated fashion,
they might just block you anyway, whether or not they are legit
themselves. Many sysadmins don't like bots searching their
databases! They might see it as searching for exploits.

brabuhr

9/17/2007 7:52:00 PM

0

On 9/17/07, Charles Pareto <chuckdawit@gmail.com> wrote:
> I work for Cisco Systems in San Jose Ca. I proposed a project to perform
> a screen scrape/spider hack to go out and look for websites with the
> Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
> etc.) and see if those companies are selling Cisco equipment. I want to
> look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
> websites and see if they are being sold for under 60% of their MSRP. We
> are trying to track down companies that are selling counterfeit
> equipment. So I started by downloading the DNS list of all domain names
> so I could read through that and extract all domain names with Cisco in
> it. Once I do that I want to go to each page and search/scrape for these
> products, but I don't really know the best approach to take. Can anyone
> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?

If someone knows of a super library that can recognize and interact
with arbitrary search forms, I would love to see it :-)

My first suggestion would be to write a simple script using Mechanize
to connect to the homepage of each site in an input list and check for
any forms. Bin the sites into three groups (no forms, at least one
form matching the regex /search/i, and at least one form). Then start
by just focusing at the ones which appear to have some sort of search
form (which may be a small or a large subset :-).

flazzarino

9/18/2007 1:38:00 AM

0

On Sep 17, 1:25 pm, Charles Pareto <chuckda...@gmail.com> wrote:
> I work for Cisco Systems in San Jose Ca. I proposed a project to perform
> a screen scrape/spider hack to go out and look for websites with the
> Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
> etc.) and see if those companies are selling Cisco equipment. I want to
> look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
> websites and see if they are being sold for under 60% of their MSRP. We
> are trying to track down companies that are selling counterfeit
> equipment. So I started by downloading the DNS list of all domain names
> so I could read through that and extract all domain names with Cisco in
> it. Once I do that I want to go to each page and search/scrape for these
> products, but I don't really know the best approach to take. Can anyone
> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?
> --
> Posted viahttp://www.ruby-....

Hpricot
http://code.whytheluckystiff.ne... is a great screen scrape
library for ruby.

scraping might not be the best approach because each site/page uses a
different layout, therefore the same scrape recipe probably won't work
for another page.

you could scrape froogle (google products?) or some other aggregate
consumer sales site. it will have one interface and probably a lot of
data. you might want to see if there are web services for froogle,
usually better than scraping.

Glenn Gillen

9/20/2007 9:34:00 AM

0

> I work for Cisco Systems in San Jose Ca. I proposed a project to
> perform
> a screen scrape/spider hack to go out and look for websites with the
> Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
> etc.) and see if those companies are selling Cisco equipment. I
> want to
> look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
> websites and see if they are being sold for under 60% of their
> MSRP. We
> are trying to track down companies that are selling counterfeit
> equipment. So I started by downloading the DNS list of all domain
> names
> so I could read through that and extract all domain names with
> Cisco in
> it. Once I do that I want to go to each page and search/scrape for
> these
> products, but I don't really know the best approach to take. Can
> anyone
> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?

I'm slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://sc...

Glenn

brabuhr

9/20/2007 6:54:00 PM

0

> > give me advice? Should I just do keyword searches for those 20+
> > products? Or is there a better approach?

On 9/20/07, Glenn Gillen <glenn.gillen@gmail.com> wrote:
> I'm slightly biased, but scrubyt should be able to do most of the
> remaining heavy lifting for you
>
> http://sc...

On that note:

require "rubygems"
require "scrubyt"

froogle_data = Scrubyt::Extractor.define do
fetch "http://www.google.com/prod...
fill_textfield "q", "WIC-1T"
submit

info do
product "WIC-1T"
vendor "NEW2U Hardware from ..."
price "$40.00"
end
next_page "Next", :limit => 10
end

puts froogle_data.to_xml

(tons of improvement needed, but):

<root>
<info>
<product>WIC-1T</product>
<vendor>NEW2U Hardware from ...</vendor>
<price>$40.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>ATS Computer Systems...</vendor>
<price>$353.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$49.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$149.99</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>PCsForEveryone.com</vendor>
<price>$337.07</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>COL - Computer Onlin...</vendor>
<price>$149.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eCOST.com</vendor>
<price>$297.14</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$45.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>ATACOM</vendor>
<price>$291.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>Express IT Options</vendor>
<price>$216.44</price>
</info>
</root>

Glenn Gillen

9/21/2007 3:46:00 PM

0


On 20/09/2007, at 7:54 PM, brabuhr@gmail.com wrote:

>>> give me advice? Should I just do keyword searches for those 20+
>>> products? Or is there a better approach?
>
> On 9/20/07, Glenn Gillen <glenn.gillen@gmail.com> wrote:
>> I'm slightly biased, but scrubyt should be able to do most of the
>> remaining heavy lifting for you
>>
>> http://sc...
>
> On that note:
>
> <snip>
> (tons of improvement needed, but):
> <snip>

It's by no means a silver bullet, but could very well get you 80%
there. Setup a basic learning extract that is fairly generic looking
for terms you know will exist on the domains you want (say a model
number and a dollar sign?), have it loop over the URLs with products
on them, output the learner to production extractor and then tweak
the sites that aren't giving you the exact results you want.

Or, make life easier if you can and let froogle put it all into a
single format for you.

Best of luck,

Glenn

Alan Smithee

5/22/2013 3:56:00 AM

0

Alan Smithee <alms@last.inc> wrote:
> Miley Cyrus reveals she will top Maxim's Hot 100 list...
> stealing the crown from supermodel Bar Refaeli

Mason Barge wrote:
> This has GOT to be some kind of a joke.

A particularly cruel one, and it's on the subscribers.