Alex Young
12/13/2007 9:32:00 PM
ara.t.howard wrote:
>
> On Dec 13, 2007, at 10:36 AM, Alex Young wrote:
>
>> My word. I think you've just saved me a ton of work. Yet again.
>>
>> A quick question, though. How difficult is it to set up parallel job
>> queues, so that a cluster node can pick up jobs from one queue,
>> process them, and submit them to the next in a chain? Take a search
>> engine's spider as an example - from 20,000 feet you've got a job that
>> fetches a page, a job to parse the contents, followed by a third to
>> index the parsed structure. Chances are that you want different types
>> of cluster node to work on each type of job, and there's different
>> data that you might want to attach at each stage. Is that easy to set
>> up?
>
>
>
> not exactly but this would be quite close:
>
> 1) i'd forget about having specialized nodes unless you have a very good
> reason - the death of one node will halt the entire processing chain
> otherwise.
I should have been a little clearer - I'm not thinking of one node per
task, it's one *class* of nodes per task. I might want 21 processes
across 3 machines (for example) all working on the first stage in the chain.
> it's nice if nodes are dumb from the perspective of
> robustness. that said i'll add a feature where you can say
>
> Bj.submit 'job.exe', :runner => 'some.hostname'
>
> to specify which host to run on. this'll be two lines of code so i
> don't mind adding it.
I can see how that'd be a handy thing to have anyway :-)
> 2) bj supports priorities so here is what i would do. say you've got a
> three stage job: a, b, c and 1000 initial 'a' tasks. furthermore let's
> say you make a ./scripts/ directory in your rails_root (bj runs all jobs
> from the rails_root). so you'll have something like
>
> ./scripts/task_a
> ./scripts/task_b
> ./scripts/task_c
>
> then you'd do something like this in your rails app
(I'm not using Rails for the app I'm thinking of using this in, but
that's not important)
> jobs = inputs.map{|input| "./scripts/task_a #{ input }}
> Bj.submit jobs, :priority => 10
>
> now task_a is going to do this
>
> #! /usr/bin/env ruby
> input = ARGV.shift
> output = process_for_task_a input
> system "./script/bj submit ./scripts/task_b #{ output } --priority=20"
>
> (of course, if your processing needs to be run through ./script/runner
> you'll just be able to use the api directly instead of the cli... i'll
> be adding a feature shortly to allow for running ruby code through
> script runner directly)
>
> task_b, for it's part, runs and submits task_c at priority=30.
>
> so think about that for a minute and imagine you have three processes
> nodes - each will consume a task_a, run it, and then submit a
> priority=20 job. therefore each node will probably then get one of
> those higher priority jobs, run that, and then find the priority=30
> task_c job in the queue. when those are done there will nothing left
> except priority=10 task_a jobs and another batch will start.
>
>
> so this will give you parallel processing of a host of tasks.
>
> make sense?
It does, but it's not *quite* what I'm after. I've got a few other
requirements that this strains against - the most pertinent being that
I'd like to be able to use priority independently within each task
queue. I've got a bit of spare time coming up in the next couple of
weeks (Holiday! What a concept! :-), so I'll try hacking something
together based on your code.
--
Alex