Asp Forum - Thread and HTTP troubles

Keegan Dunn

12/13/2004 6:53:00 PM

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

Sorry about the two for one question :-P

Thanks!

5 Answers

Robert Klemme

12/13/2004 7:56:00 PM

"Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag
news:65e6c89204121310527b234a7b@mail.gmail.com...
> I'm trying to write a threaded program that will run through a list of
> web sites and download/process a set number of them at a
> time(maintaining a pool of threads that can process page
> downloads/processing). I have something simple working, but I am
> unsure how to approach the "pool" of threads idea. Is that even the
> way to go about processing multiple pages simultaneously? Is there a
> better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

> Also, how can I deal with a "socket read timeout" error? I have the
> http get call wrapped in a begin...rescue...end block, but it doesn't
> seem to be catching it. Here is the code in question:
>
> def getHTTP(site)
> siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
> begin
> masterSite = Net::HTTP.new(siteHost,80)
> siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
> resp, data = masterSite.get2(siteURL, nil)
> return data
> rescue
> return "-999"
> end
> end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.

> Sorry about the two for one question :-P

You get one answer for free. :-)

Kind regards

robert

Leslie Hensley

12/13/2004 8:27:00 PM

You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

On Tue, 14 Dec 2004 04:57:20 +0900, Robert Klemme <bob.news@gmx.net> wrote:
>
> "Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag
> news:65e6c89204121310527b234a7b@mail.gmail.com...
> > I'm trying to write a threaded program that will run through a list of
> > web sites and download/process a set number of them at a
> > time(maintaining a pool of threads that can process page
> > downloads/processing). I have something simple working, but I am
> > unsure how to approach the "pool" of threads idea. Is that even the
> > way to go about processing multiple pages simultaneously? Is there a
> > better way?
>
> It's most likely the most efficient way. You need these ingredients:
>
> - a thread safe queue
> - a pool of processors
> - a main thread that does the distribution of work
>
> You also likely want to have a class or method that deals with the details
> of fetching data and analysing / storing it to keep thread body blocks
> small.
>
> # untested but you'll get the picture
> require 'thread'
>
> THREADS = 10
> TERM = Object.new
> queue = Queue.new
> threads = []
>
> THREADS.times do
> threads << Thread.new( queue ) do |q|
> until ( TERM == ( url = q.deq ) )
> begin
> # get data from url
> rescue
> # in case of timeout try again by putting
> # it back
> end
> end
> end
> end
>
> # now read urls and distribute work
> while ( line = gets )
> line.chomp!
> queue.enq line
> end
>
> # write terminators
> THREADS.times { queue.enq TERM }
>
> # ... and wait for threads to terminate properly
> threads.each {|t| t.join}
>
> # exiting
>
>
>
> > Also, how can I deal with a "socket read timeout" error? I have the
> > http get call wrapped in a begin...rescue...end block, but it doesn't
> > seem to be catching it. Here is the code in question:
> >
> > def getHTTP(site)
> > siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
> > begin
> > masterSite = Net::HTTP.new(siteHost,80)
> > siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
> > resp, data = masterSite.get2(siteURL, nil)
> > return data
> > rescue
> > return "-999"
> > end
> > end
>
> You'll likely need to catch another exception. Try "rescue Exception => e"
> and then print e's class.
>
> > Sorry about the two for one question :-P
>
> You get one answer for free. :-)
>
> Kind regards
>
> robert
>
>

Keegan Dunn

12/13/2004 9:04:00 PM

I noticed the threads were doing that. I meant to ask about that as
well. Thank you for the help, Leslie and Robert.

On Tue, 14 Dec 2004 05:27:19 +0900, Leslie Hensley <hensleyl@gmail.com> wrote:
> You'll also want to include 'resolv-replace'. Otherwise all of your
> threads will block whenever any thread does a name lookup. Hopefully
> this wont be needed once Rite gets here...
>
> Leslie Hensley
>
>
>
> On Tue, 14 Dec 2004 04:57:20 +0900, Robert Klemme <bob.news@gmx.net> wrote:
> >
> > "Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag
> > news:65e6c89204121310527b234a7b@mail.gmail.com...
> > > I'm trying to write a threaded program that will run through a list of
> > > web sites and download/process a set number of them at a
> > > time(maintaining a pool of threads that can process page
> > > downloads/processing). I have something simple working, but I am
> > > unsure how to approach the "pool" of threads idea. Is that even the
> > > way to go about processing multiple pages simultaneously? Is there a
> > > better way?
> >
> > It's most likely the most efficient way. You need these ingredients:
> >
> > - a thread safe queue
> > - a pool of processors
> > - a main thread that does the distribution of work
> >
> > You also likely want to have a class or method that deals with the details
> > of fetching data and analysing / storing it to keep thread body blocks
> > small.
> >
> > # untested but you'll get the picture
> > require 'thread'
> >
> > THREADS = 10
> > TERM = Object.new
> > queue = Queue.new
> > threads = []
> >
> > THREADS.times do
> > threads << Thread.new( queue ) do |q|
> > until ( TERM == ( url = q.deq ) )
> > begin
> > # get data from url
> > rescue
> > # in case of timeout try again by putting
> > # it back
> > end
> > end
> > end
> > end
> >
> > # now read urls and distribute work
> > while ( line = gets )
> > line.chomp!
> > queue.enq line
> > end
> >
> > # write terminators
> > THREADS.times { queue.enq TERM }
> >
> > # ... and wait for threads to terminate properly
> > threads.each {|t| t.join}
> >
> > # exiting
> >
> >
> >
> > > Also, how can I deal with a "socket read timeout" error? I have the
> > > http get call wrapped in a begin...rescue...end block, but it doesn't
> > > seem to be catching it. Here is the code in question:
> > >
> > > def getHTTP(site)
> > > siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
> > > begin
> > > masterSite = Net::HTTP.new(siteHost,80)
> > > siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
> > > resp, data = masterSite.get2(siteURL, nil)
> > > return data
> > > rescue
> > > return "-999"
> > > end
> > > end
> >
> > You'll likely need to catch another exception. Try "rescue Exception => e"
> > and then print e's class.
> >
> > > Sorry about the two for one question :-P
> >
> > You get one answer for free. :-)
> >
> > Kind regards
> >
> > robert
> >
> >
>
>

Jim Weirich

12/13/2004 10:21:00 PM

Robert Klemme said:

> You'll likely need to catch another exception. Try "rescue Exception =>
> e" and then print e's class.

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

begin
# stuff
rescue Timeout::Error => ex
# handle timeout
end

you should be ok.

--
-- Jim Weirich jim@weirichhouse.org http://onest...
-----------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)

Keegan Dunn

12/13/2004 11:03:00 PM

Thank you for the elaboration.

On Tue, 14 Dec 2004 07:21:06 +0900, Jim Weirich <jim@weirichhouse.org> wrote:
>
> Robert Klemme said:
>
> > You'll likely need to catch another exception. Try "rescue Exception =>
> > e" and then print e's class.
>
> The error in question is Timeout::Error which inherits from Interrupt
> which in turn inherits from SignalException. Since a plain vanilla rescue
> clause will only rescue exceptions deriving from StandardError (and
> SignalException is not derived from StandardError), it won't pick up this
> exception.
>
> If you use
>
> begin
> # stuff
> rescue Timeout::Error => ex
> # handle timeout
> end
>
> you should be ok.
>
> --
> -- Jim Weirich jim@weirichhouse.org http://onest...
> -----------------------------------------------------------------
> "Beware of bugs in the above code; I have only proved it correct,
> not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)
>
>

comp.lang.ruby

Thread and HTTP troubles

Keegan Dunn

Robert Klemme

Leslie Hensley

Keegan Dunn

Jim Weirich

Keegan Dunn

x Login to ForumsZone