`nomad job restart` RFC

While Nomad has the ability to restart individual allocations and tasks, it lack the ability to restart entire jobs.

A new `nomad job restart command will fill this gap by adding the ability to restart all allocations for a job.

Behavior

Functionally job restart will be as if the user scripted a rolling nomad alloc restart for all allocs of a given job. Importantly job restart will not reschedule any allocations. It can be thought of an inplace update with nothing being updated.

Workarounds

Deployments

The closest thing to nomad job restart today is to update a non-functional meta variable on a job and cause a deployment.

The problem with this is that the update stanza may gate the deploy much more than is desired. If a user is merely trying to quickly fix a memory leak or break out of some other pathological state the allocations are in, a deploy may be much slower than is desired.

While the update stanza could be changed to speed up the deployment, the operator must remember to change the update stanza back when next deploying the job. There's no way to temporarily override update parameters for a single deploy. This may fit poorly with the operator's workflow and lead to problems in high pressure situations nomad job restart may be used during.

Template Restarts

Another workaround for the lack of nomad job restart is to have a template stanza which restarts allocations when a key in Consul changes. If operators need to quickly restart all allocations for a job, they can change the Consul key to trigger a restart.

There are 2 problems with this approach:

The template.splay parameter is difficult to set properly to ensure restarts happen quickly but without causing downtime.
The template.change_mode parameter must be restart (or an unignorable signal) to get the desired behavior. This means an operators must remember ahead of issues to include such a template in every job if one does not already exist.

Scripting alloc restarts

The most common workaround empoyed today is manually scripting nomad alloc restarts command. This is non-trivial to do correctly, especially in the face of transient API errors/timeouts or concurrent job updates.

Even if all error and concurrent mutation conditions are handled properly the restart is still controlled by a single script on a single machine (either an operator's machine or a random node running a web app that can perform restarts). Unless the script also stores the restart's state (eg what allocs have been restarted) in durable networked storage, no other users have visibility into the restart process or the ability to continue a restart that gets interrupted (perhaps manually or perhaps by the script crashing).

If all of these cases and features are implemented, then they should be shared with all Nomad users which is the point of this effort entirely!

Implementation

CLI

nomad job restart -batch-size n -batch-wait d <job id> - where batch-size is the number of allocs to restart at once and batch-wait is the duration to wait between restarts.
Outputs the JobRestart ID
By default blocks and outputs progress of restarts
-detach can be added to exit after outputting the restart ID
nomad job restart -attach <restart id> attaches to a running restart.
nomad job restart -pause <restart id> pauses a running restart.
nomad job restart -resume <restart id> resumes a paused restart.
nomad job restart -cancel <restart id> cancels a restart. (terminal, cannot be resumed)

State

Similar to deployments, restarts will be a first class object:

type JobRestart struct {
  ID    string // uuid of restart
  JobID string // job being restarted
  
  BatchSize int           // number of allocations to restart at once
  BatchWait time.Duration // amount of time to wait between restarting each batch
  
  Status string // running, paused, cancelled, complete
  
  RestartedAllocs []string // Allocation IDs which have been restarted
  
  StartedAt time.Time // time restart was started
  UpdatedAt time.Time // time restart was last modified
  
  CreateIndex uint64  // raft index when restart was started
  ModifyIndex uint64  // raft index when restart was last modified
}

A JobRestarter similar to DeploymentWatcher and Drainer will run on the leader server to coordinate restarts. It will be cancelled when a leader election occurs, and started on the newly elected leader.

RestartedAllocs has restarted allocations appended after the restart has been signalled. This means a leader election (or leader crash) after a restart has been signalled but before the JobRestart object has been updated in raft will cause some allocations to be restarted more than once. If this proves problematic more logic and/or state can be added in the future to further improve restart state handling across leader elections.

API

POST /v1/job/:job_id/restart
Body should accept CLI parameters as JSON
GET /v1/job/:job_id/restarts - list restarts for a job
DELETE /v1/job/:job_id/restart/:restart_id - cancel restart
POST /v1/job/:job_id/restart/:restart_id/pause pause restart
POST /v1/job/:job_id/restart/:restart_id/resume resume restart

Concurrent Job Updates

nomad job restart will refuse to run if there is an ongoing deployment.
Deployments will cancel any non-terminal restarts for their job.
Restarts will not restart allocations whose Alloc.CreateIndex > JobRestart.CreateIndex. This prevents restarts taking an unbounded amount of time if allocations are also failing and being rescheduled. Allocations created after a restart has been created are treated as already restarted.
nomad job restart will fail if there is an existing non-terminal restart.

GC

To improve observability, restarts will not be immediately deleted upon completion. Instead they will be garbage collected like deployments. The amount of time a completed restart will remain until it is garbage collected will be configurable on server agents:

server.gc_restart_threshold = 4h - As restarts are relatively small, infrequently used, and often used in scenarios where observability is critical, they have a higher default GC threshold than deployments.

Future Work: Health Checks

Job restarts may want to gate restarts on the health of restarted allocations, similar to deployments.

This feature could be added in the future without hurting backward compatiability of the proposed job restart functionality.

Alternative: Stateless Restarts

The vast majority of the complexity of implementing nomad job restart as above is the stateful nature: there's a restart object and users can interact with it.

A much simpler MVP would be to simply embed the "script over nomad alloc restart" mentioned as a workaround above in the Nomad CLI or HTTP API. If the controlling process (CLI or agent running the HTTP API handler) crashes, the restart and its progress are simply gone.

There would also be no way for the Nomad UI to represent restarts or for more than one user to interact with a given restart.

References

Issue #698

schmichael/nomad-job-restart-rfc.md

`nomad job restart` RFC

Behavior

Workarounds

Deployments

Template Restarts

Scripting alloc restarts

Implementation

CLI

State

API

Concurrent Job Updates

GC

Future Work: Health Checks

Alternative: Stateless Restarts

References

robloxrob commented Oct 22, 2021

Uh oh!

schmichael/nomad-job-restart-rfc.md

nomad job restart RFC

Behavior

Workarounds

Deployments

Template Restarts

Scripting alloc restarts

Implementation

CLI

State

API

Concurrent Job Updates

GC

Future Work: Health Checks

Alternative: Stateless Restarts

References

robloxrob commented Oct 22, 2021

Uh oh!

`nomad job restart` RFC