James Britt
8/19/2007 12:29:00 AM
Elliot Temple wrote:
> Elliot Temple wrote:
>> James Britt wrote:
>>
>>> No, first consider the people hosting the content you're snarfing.
>>>
>>> They're footing the bill for bandwidth and hosting.
>>>
>>>> ... FYI the images total
>>>> about 112 megs. There's 3691 of them.
>>>>
>>> And not a single "sleep" in the script. Nice.
>> Hi James,
>>
>> It's a good thing I posted. I will remember to put a sleep next time.
>> Thank you.
>
> Oh. How much sleep is best?
60*60*24 might work.
> One second per image would add an hour to
> the script run time.
Gosh! Imagine having to wait a *whole hour* to glom someone else's
content!
> I don't have a sense of how much is needed. 5
> seconds? .5 seconds? Is requests per time or volume of data per time
> more important to limit?
You're encouraging people to download 112 MB via 3691 requests from
someone else's Web site.
Right now, the only thing I see being limited is courtesy.
If you abuse a Web site you may have your IP address banned.
Sadly, most people running sites do not have the technical chops to
catch such behavior and cut people off before too much damage is done.
More likely, the target site will either go off-line for excessive
bandwidth, or the owner will get a surprise bill for overages.
There are often very good reasons to spider a site and grab content.
When needed, it must be done in a responsible way. Your example fails
that, both in motivation and technique.
--
James Britt
"Simplicity of the language is not what matters, but
simplicity of use."
- Richard A. O'Keefe in squeak-dev mailing list