brabuhr
9/17/2007 7:52:00 PM
On 9/17/07, Charles Pareto <chuckdawit@gmail.com> wrote:
> I work for Cisco Systems in San Jose Ca. I proposed a project to perform
> a screen scrape/spider hack to go out and look for websites with the
> Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
> etc.) and see if those companies are selling Cisco equipment. I want to
> look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
> websites and see if they are being sold for under 60% of their MSRP. We
> are trying to track down companies that are selling counterfeit
> equipment. So I started by downloading the DNS list of all domain names
> so I could read through that and extract all domain names with Cisco in
> it. Once I do that I want to go to each page and search/scrape for these
> products, but I don't really know the best approach to take. Can anyone
> give me advice? Should I just do keyword searches for those 20+
> products? Or is there a better approach?
If someone knows of a super library that can recognize and interact
with arbitrary search forms, I would love to see it :-)
My first suggestion would be to write a simple script using Mechanize
to connect to the homepage of each site in an input list and check for
any forms. Bin the sites into three groups (no forms, at least one
form matching the regex /search/i, and at least one form). Then start
by just focusing at the ones which appear to have some sort of search
form (which may be a small or a large subset :-).