Unfinished draft; do not use until this notice is removed.
We were seeing some unexpected behavior in the processes that Jenkins launches when the Jenkins user clicks "cancel" on their job. Unexpected behaviors like:
- apparently stale lockfiles and pidfiles
- overlapping processes
- jobs apparently ending without performing cleanup tasks
- jobs continuing to run after being reported "aborted"
This is an investigation into what exactly happens when Jenkins cancels a job, and a set of best-practices for writing Jenkins jobs that behave the way you expect.
First, recall the name and purpose of some Unix process signals:
- 1,
HUP
, "hangup"; the controlling terminal disconnected, or the controlling process died. - 2,
INT
, "interrupt"; the user typedCTRL+C
. - 9,
KILL
, "kill"; terminate immediately, no cleanup. Can't be trapped. - 15,
TERM
, "terminate," but cleanup first. May be trapped. The default when using thekill
command.
When a Jenkins job is cancelled, it sends TERM
to the process group of the process it spawns, and immediately disconnects, reporting "Finished: ABORTED," regardless of the state of the job.
This causes the spawned process and all its subprocesses (unless they are spawned in new process groups) to receive TERM
.
This is the same effect as running the job in your terminal and pressing CTRL+C, except in the latter situation INT
is sent, not TERM
.
Since Jenkins disconnects immediately from the child process and reports no further output:
-
it can misleadingly appear that signal handlers are not being invoked. (Which has mislead many to think that Jenkins has
KILL
ed the process.) -
it can misleadingly appear that spawned processes have completed or exited, while they continue to run.
When a bash script has trapped a signal, it waits for any processes it spawned to complete before handling it. If a subprocess is stuck or ignoring signals, the controlling script may take a very long time to exit, despite the existence of a signal handler. If one of your subprocesses does not properly exit upon being signaled, it's not enough to create a signal handler in the calling shell script that more forcefully KILL
s it, because bash will wait for the process to complete before attempting that action.
If no hashbang is given on the first line of Jenkins' shell command specification, Jenkins defaults to /bin/sh -xe
to interpret the script. When -e
is active, any spawned process that exits with a nonzero status causes the script to abort. So it can misleadingly appear that sh
signals its subprocesses and automatically exits when signaled even if a trap is set, but behave differently with other languages or when someone pastes in a #!/bin/sh
with no -e
.
In some Jenkins jobs, we have a process (script) that invokes another on a different machine through ssh
.
OpenSSH, unlike RSH, does not pass received signals to the remote process, nor does it provide a mechanism to manually send a signal to the remote process, even though this capability is specified in RFC4254. It will pass a CTRL+C character, but only when a pseudoterminal is allocated (i.e. only when invoked with -tt
.) It will send HUP
to the remote process group when it disconnects, but only when a pseudoterminal is allocated.
Recall that scripts may receive signals TERM
(from Jenkins), HUP
(from ssh), or INT
(from you typing CTRL+C while testing).
The default action for shell scripts upon receiving HUP
, INT
, and TERM
is to abort, which triggers the EXIT
handler just before execution ends. So for shell scripts, the EXIT
trap is a good place for cleanup code. (EXIT
is not really a signal; it is a POSIX shell mechanism to trigger a handler just before execution ends, regardless of how it ends.)
In contrast, the default action for Python upon receiving INT
is to raise a KeyboardInterrupt exception, which if unhandled causes Python to abort, which triggers any handlers registered with atexit.register()
just before execution ends. However, that handler is by default not invoked upon Python receiving HUP
or TERM
; instead, Python aborts immediately. So, for your atexit
handler to fire you must also explicitly trap those signals with signal.signal()
and cause execution to end.
-
If you need to write a wrapper for a process that does not exit when signalled, write it in Python, or use a shell trick (run the subprocess as a background task and call
wait
). -
If you're doing remote orchestration through ssh, always invoke it with
-tt
so that the remote processes will receiveHUP
if the ssh connection goes away. If a pseudoterminal causes problems, there are other workarounds like an 'EOF to SIGHUP' wrapper. -
For shell scripts, trap EXIT and place cleanup code there. For Python, use the
atexit
module and register a cleanup handler, and also use thesignal
module to either raise an exception or call sys.exit() uponHUP
andTERM
, like this:for s in [signal.SIGHUP, signal.SIGTERM]: signal.signal(s, lambda n, _: sys.exit("Received signal %d" % n))
- Open OpenSSH bug #396: sshd orphans processes when no pty allocated (2002)
- Open OpenSSH bug #1424: Cannot signal a process over a channel (rfc 4254, section 6.9) (2008)
- Overview of standard signals:
man 7 signal
- Jenkins bug #JENKINS-17116 gracefull job termination (Incorrectly states that Jenkins uses
KILL
.) - Python
atexit
module documentation - Python
signal
module documentation - Jenkins wiki "Aborting a build"
- RFC4254 SSH Connection Protocol (Signals specified in section 6.9.)
- Greg's Wiki: Sending and Trapping Signals: When is the signal handled?
- GNU Bash Manual: Signals
#FIXME
- Verify that Jenkins signals the process group. It doesn't appear so from Jenkins source code. Maybe the shell is reissuing the signal to its own process group?
- Identify Bash-isms; will a system with a different /bin/sh (like dash) behave differently than I have described here?
- Clean up and include my small scripts which demonstrate each of the assertions above.
- What negative side effects could forcing ssh to perform pty allocation have on a non-interactive script?
- Will signals propagate through
sudo
?