Today @jdhuntington suggested writing a queuing system. This got me to thinking: What would I want in an ideal work queue? Here are some of my thoughts. I am eager to hear yours.

Key

Life without these has been difficut.

Reliable requeuing of failed jobs.

If a job craps out, don't just drop it on the floor. Keep re-trying until it succeeds. Bonus points for a configurable retry limit. Über bonus points for sending a notification when a job fails.

Easily auditable and monitorable

What output did that particular job generate? How about that one that is still running? Let me see the backtrace for that failed job. How many attempts were made before it succeeded? How long did each attempt take? In aggregate, how many of job Foo have I run? How many of those failed? How many of type Foo are in the queue right now?

How many of my workers are busy right now? What are the longest running jobs? How long is each queue? How long have jobs been waiting in the queue.

I should be able to answer these questions in a web interface or programatically.

Timeout for long-running jobs.

Sometimes code gets stuck. Ability to override the timeouts is important since different tasks will have different definitions of "long." Perhaps one global setting with a per-job override. Alternately different queues could have different timeouts, but that's less flexible.

Reliable

If I lose a server or two, my cluster should keep going. Any jobs running on those servers should count as failed and be requeued like any other failed job.

Easy job killing via web interface

'Nuff said.

Expandable

I need the ability to add capacity to a busy cluster without incuring downtime.

Nice to have

These vary quite a bit in importance.

Deferred jobs

Wait 10 minutes/hours/days before running this job. If you are busy or shut-down when the time arrives, run the job the first chance you get.

Priority queues

Let me provide a predicate which will determine where job goes in the queue. Perhaps I am most interested in the most recent commit to my master branch, so I want to see those builds cut in line ahead of previous builds of master. Perhaps jobs which have failed should defer to jobs with no (or fewer) failures.

Graceful termination

I want to reduce the size of my cluster without downtime or interruping any jobs. Let me stop accepting new jobs on a particular server while allowing existing jobs to complete.

Failure logic

I'd love the ability to write my own logic to decide how to act in the event of failure. Do I need to send a notification? Should the job go into a lower-priority queue? Should I wait before enquing it again?

Locking

Disallow more than one job of a particular type from running at the same time. Disallow more than one of them in the queue at a time.

mkb/gist:4760712

Select an option

No results found