Today @jdhuntington suggested writing a queuing system. This got me to thinking: What would I want in an ideal work queue? Here are some of my thoughts. I am eager to hear yours.
Life without these has been difficut.
If a job craps out, don't just drop it on the floor. Keep re-trying until it succeeds. Bonus points for a configurable retry limit. Über bonus points for sending a notification when a job fails.
What output did that particular job generate? How about that one that is still running? Let me see the backtrace for that failed job. How many attempts were made before it succeeded? How long did each attempt take? In aggregate, how many of job Foo have I run? How many of those failed? How many of type Foo are in the queue right now?
How many of my workers are busy right now? What are the longest running jobs? How long is each queue? How long have jobs been waiting in the queue.
I should be able to answer these questions in a web interface or programatically.
Sometimes code gets stuck. Ability to override the timeouts is important since different tasks will have different definitions of "long." Perhaps one global setting with a per-job override. Alternately different queues could have different timeouts, but that's less flexible.
If I lose a server or two, my cluster should keep going. Any jobs running on those servers should count as failed and be requeued like any other failed job.
'Nuff said.
I need the ability to add capacity to a busy cluster without incuring downtime.
These vary quite a bit in importance.
Wait 10 minutes/hours/days before running this job. If you are busy or shut-down when the time arrives, run the job the first chance you get.
Let me provide a predicate which will determine where job goes in the queue. Perhaps I am most interested in the most recent commit to my master branch, so I want to see those builds cut in line ahead of previous builds of master. Perhaps jobs which have failed should defer to jobs with no (or fewer) failures.
I want to reduce the size of my cluster without downtime or interruping any jobs. Let me stop accepting new jobs on a particular server while allowing existing jobs to complete.
I'd love the ability to write my own logic to decide how to act in the event of failure. Do I need to send a notification? Should the job go into a lower-priority queue? Should I wait before enquing it again?
Disallow more than one job of a particular type from running at the same time. Disallow more than one of them in the queue at a time.