While Nomad has the ability to restart individual allocations and tasks, it lack the ability to restart entire jobs.
A new `nomad job restart command will fill this gap by adding the ability to restart all allocations for a job.
Functionally job restart
will be as if the user scripted a rolling nomad alloc restart
for all allocs of a given job. Importantly job restart
will not reschedule any allocations. It can be thought of an inplace update with nothing being updated.
The closest thing to nomad job restart
today is to update a non-functional meta
variable on a job and cause a deployment.
The problem with this is that the update
stanza may gate the deploy much more than is desired. If a user is merely trying to quickly fix a memory leak or break out of some other pathological state the allocations are in, a deploy may be much slower than is desired.
While the update
stanza could be changed to speed up the deployment, the operator must remember to change the update
stanza back when next deploying the job. There's no way to temporarily override update
parameters for a single deploy. This may fit poorly with the operator's workflow and lead to problems in high pressure situations nomad job restart
may be used during.
Another workaround for the lack of nomad job restart
is to have a template stanza which restarts allocations when a key in Consul changes. If operators need to quickly restart all allocations for a job, they can change the Consul key to trigger a restart.
There are 2 problems with this approach:
- The
template.splay
parameter is difficult to set properly to ensure restarts happen quickly but without causing downtime. - The
template.change_mode
parameter must berestart
(or an unignorable signal) to get the desired behavior. This means an operators must remember ahead of issues to include such a template in every job if one does not already exist.
The most common workaround empoyed today is manually scripting nomad alloc restarts
command. This is non-trivial to do correctly, especially in the face of transient API errors/timeouts or concurrent job updates.
Even if all error and concurrent mutation conditions are handled properly the restart is still controlled by a single script on a single machine (either an operator's machine or a random node running a web app that can perform restarts). Unless the script also stores the restart's state (eg what allocs have been restarted) in durable networked storage, no other users have visibility into the restart process or the ability to continue a restart that gets interrupted (perhaps manually or perhaps by the script crashing).
If all of these cases and features are implemented, then they should be shared with all Nomad users which is the point of this effort entirely!
nomad job restart -batch-size n -batch-wait d <job id>
- where batch-size is the number of allocs to restart at once and batch-wait is the duration to wait between restarts.- Outputs the JobRestart ID
- By default blocks and outputs progress of restarts
-detach
can be added to exit after outputting the restart IDnomad job restart -attach <restart id>
attaches to a running restart.nomad job restart -pause <restart id>
pauses a running restart.nomad job restart -resume <restart id>
resumes a paused restart.nomad job restart -cancel <restart id>
cancels a restart. (terminal, cannot be resumed)
Similar to deployments, restarts will be a first class object:
type JobRestart struct {
ID string // uuid of restart
JobID string // job being restarted
BatchSize int // number of allocations to restart at once
BatchWait time.Duration // amount of time to wait between restarting each batch
Status string // running, paused, cancelled, complete
RestartedAllocs []string // Allocation IDs which have been restarted
StartedAt time.Time // time restart was started
UpdatedAt time.Time // time restart was last modified
CreateIndex uint64 // raft index when restart was started
ModifyIndex uint64 // raft index when restart was last modified
}
A JobRestarter similar to DeploymentWatcher and Drainer will run on the leader server to coordinate restarts. It will be cancelled when a leader election occurs, and started on the newly elected leader.
RestartedAllocs
has restarted allocations appended after the restart has been signalled. This means a leader election (or leader crash) after a restart has been signalled but before the JobRestart object has been updated in raft will cause some allocations to be restarted more than once. If this proves problematic more logic and/or state can be added in the future to further improve restart state handling across leader elections.
POST /v1/job/:job_id/restart
- Body should accept CLI parameters as JSON
GET /v1/job/:job_id/restarts
- list restarts for a jobDELETE /v1/job/:job_id/restart/:restart_id
- cancel restartPOST /v1/job/:job_id/restart/:restart_id/pause
pause restartPOST /v1/job/:job_id/restart/:restart_id/resume
resume restart
nomad job restart
will refuse to run if there is an ongoing deployment.- Deployments will cancel any non-terminal restarts for their job.
- Restarts will not restart allocations whose Alloc.CreateIndex > JobRestart.CreateIndex. This prevents restarts taking an unbounded amount of time if allocations are also failing and being rescheduled. Allocations created after a restart has been created are treated as already restarted.
nomad job restart
will fail if there is an existing non-terminal restart.
To improve observability, restarts will not be immediately deleted upon completion. Instead they will be garbage collected like deployments. The amount of time a completed restart will remain until it is garbage collected will be configurable on server agents:
server.gc_restart_threshold = 4h
- As restarts are relatively small, infrequently used, and often used in scenarios where observability is critical, they have a higher default GC threshold than deployments.
Job restarts may want to gate restarts on the health of restarted allocations, similar to deployments.
This feature could be added in the future without hurting backward compatiability of the proposed job restart functionality.
The vast majority of the complexity of implementing nomad job restart
as above is the stateful nature: there's a restart object and users can interact with it.
A much simpler MVP would be to simply embed the "script over nomad alloc restart
" mentioned as a workaround above in the Nomad CLI or HTTP API. If the controlling process (CLI or agent running the HTTP API handler) crashes, the restart and its progress are simply gone.
There would also be no way for the Nomad UI to represent restarts or for more than one user to interact with a given restart.
- Issue #698
Nice!