Skip to content

Instantly share code, notes, and snippets.

@datagrok
Last active August 16, 2024 10:32
Show Gist options
  • Save datagrok/dfe9604cb907523f4a2f to your computer and use it in GitHub Desktop.
Save datagrok/dfe9604cb907523f4a2f to your computer and use it in GitHub Desktop.
What happens when you cancel a Jenkins job

When you cancel a Jenkins job

Unfinished draft; do not use until this notice is removed.

We were seeing some unexpected behavior in the processes that Jenkins launches when the Jenkins user clicks "cancel" on their job. Unexpected behaviors like:

  • apparently stale lockfiles and pidfiles
  • overlapping processes
  • jobs apparently ending without performing cleanup tasks
  • jobs continuing to run after being reported "aborted"

This is an investigation into what exactly happens when Jenkins cancels a job, and a set of best-practices for writing Jenkins jobs that behave the way you expect.

First, recall the name and purpose of some Unix process signals:

  • 1, HUP, "hangup"; the controlling terminal disconnected, or the controlling process died.
  • 2, INT, "interrupt"; the user typed CTRL+C.
  • 9, KILL, "kill"; terminate immediately, no cleanup. Can't be trapped.
  • 15, TERM, "terminate," but cleanup first. May be trapped. The default when using the kill command.

Jenkins

When a Jenkins job is cancelled, it sends TERM to the process group of the process it spawns, and immediately disconnects, reporting "Finished: ABORTED," regardless of the state of the job.

This causes the spawned process and all its subprocesses (unless they are spawned in new process groups) to receive TERM.

This is the same effect as running the job in your terminal and pressing CTRL+C, except in the latter situation INT is sent, not TERM.

Since Jenkins disconnects immediately from the child process and reports no further output:

  • it can misleadingly appear that signal handlers are not being invoked. (Which has mislead many to think that Jenkins has KILLed the process.)

  • it can misleadingly appear that spawned processes have completed or exited, while they continue to run.

Bash

When a bash script has trapped a signal, it waits for any processes it spawned to complete before handling it. If a subprocess is stuck or ignoring signals, the controlling script may take a very long time to exit, despite the existence of a signal handler. If one of your subprocesses does not properly exit upon being signaled, it's not enough to create a signal handler in the calling shell script that more forcefully KILLs it, because bash will wait for the process to complete before attempting that action.

If no hashbang is given on the first line of Jenkins' shell command specification, Jenkins defaults to /bin/sh -xe to interpret the script. When -e is active, any spawned process that exits with a nonzero status causes the script to abort. So it can misleadingly appear that sh signals its subprocesses and automatically exits when signaled even if a trap is set, but behave differently with other languages or when someone pastes in a #!/bin/sh with no -e.

SSH

In some Jenkins jobs, we have a process (script) that invokes another on a different machine through ssh.

OpenSSH, unlike RSH, does not pass received signals to the remote process, nor does it provide a mechanism to manually send a signal to the remote process, even though this capability is specified in RFC4254. It will pass a CTRL+C character, but only when a pseudoterminal is allocated (i.e. only when invoked with -tt.) It will send HUP to the remote process group when it disconnects, but only when a pseudoterminal is allocated.

Cleanup actions in Bash vs. Python

Recall that scripts may receive signals TERM (from Jenkins), HUP (from ssh), or INT (from you typing CTRL+C while testing).

The default action for shell scripts upon receiving HUP, INT, and TERM is to abort, which triggers the EXIT handler just before execution ends. So for shell scripts, the EXIT trap is a good place for cleanup code. (EXIT is not really a signal; it is a POSIX shell mechanism to trigger a handler just before execution ends, regardless of how it ends.)

In contrast, the default action for Python upon receiving INT is to raise a KeyboardInterrupt exception, which if unhandled causes Python to abort, which triggers any handlers registered with atexit.register() just before execution ends. However, that handler is by default not invoked upon Python receiving HUP or TERM; instead, Python aborts immediately. So, for your atexit handler to fire you must also explicitly trap those signals with signal.signal() and cause execution to end.

Takeaways

  1. If you need to write a wrapper for a process that does not exit when signalled, write it in Python, or use a shell trick (run the subprocess as a background task and call wait).

  2. If you're doing remote orchestration through ssh, always invoke it with -tt so that the remote processes will receive HUP if the ssh connection goes away. If a pseudoterminal causes problems, there are other workarounds like an 'EOF to SIGHUP' wrapper.

  3. For shell scripts, trap EXIT and place cleanup code there. For Python, use the atexit module and register a cleanup handler, and also use the signal module to either raise an exception or call sys.exit() upon HUP and TERM, like this:

     for s in [signal.SIGHUP, signal.SIGTERM]:
         signal.signal(s, lambda n, _: sys.exit("Received signal %d" % n))
    

References

#FIXME

  • Verify that Jenkins signals the process group. It doesn't appear so from Jenkins source code. Maybe the shell is reissuing the signal to its own process group?
  • Identify Bash-isms; will a system with a different /bin/sh (like dash) behave differently than I have described here?
  • Clean up and include my small scripts which demonstrate each of the assertions above.
  • What negative side effects could forcing ssh to perform pty allocation have on a non-interactive script?
  • Will signals propagate through sudo?
@chadseippel
Copy link

I've used Jenkins to call a java command line application, and found different results. The java shutdown hook gets called if I press Ctrl+C on the command line, or when a build ends normally (I'm assuming because of a SIGTERM). But if I abort the job, the shutdown hook doesn't get called. Is there something I could be missing? If the abort sends a SIGTERM, the shutdown hook would catch it.

@disq
Copy link

disq commented Feb 19, 2016

Just came up with this:

#!/bin/bash
set -euf -o pipefail
echo Process group id is $$

OK=0
trap 'if [[ "$OK" != "1" ]]; then echo "---"; echo TRAP terminating, kill all processes with parent $$; trap - SIGTERM && pkill -P $$; fi' SIGINT SIGTERM EXIT

SECONDS=0
echo "---"
"$@" &
PID=$!
set +e
wait $PID
CODE=$?
set -e
OK=1

echo "---"
echo Process exited with code $CODE, took $SECONDS seconds

exit $CODE

@elyzov
Copy link

elyzov commented Dec 8, 2016

@datagrok But Jenkins really send KILL signal on job termination. How could you explain this behavior:

  1. Create simple bash script:
#!/bin/bash

getAbort()
{
 echo "$(date) - Abort detected" > ~/sig.log
 # other commands here if any...
}


trap 'getAbort; exit' SIGHUP SIGINT SIGTERM
echo "Sig handler setted"

rm -f ~/sig.log
sleep 100
  1. Create job with shell command
/bin/bash -xe /home/deployer/sigtest.sh
  1. wait for log in other console
watch -n1 cat sig.log
  1. Start jenkins job and terminate it immediately
Started by user xxx
[EnvInject] - Loading node environment variables.
Building remotely on XXX (xxx) in workspace /home/deployer/workspace/sigtest
[sigtest] $ /bin/sh -xe /tmp/hudson7096756864029913873.sh
+ /bin/bash -xe /home/deployer/sigtest.sh
+ trap 'getAbort; exit' SIGHUP SIGINT SIGTERM SIGKILL
+ echo 'Sig handler setted'
Sig handler setted
+ rm -f /home/deployer/sig.log
+ sleep 100
Terminated
Build was aborted
Aborted by xxx
Finished: ABORTED
  1. In other console we see, that sig.log file was not created during termination, so script just was killed.

@stephencroberts
Copy link

@datagrok Thanks for documenting! I know this thread is old, but I'm running into issues executing terraform with Jenkins. In my testing, it appears that the process group is NOT receiving SIGTERM. It looks like SIGTERM may be sent to the job executor which then disconnects and reports "Aborted", leaving the job process group still running. Did you find more information about this, or any workarounds? I need the foreground process (terraform) to receive SIGTERM to gracefully shut down. Thanks!

@alexandarZ
Copy link

This also happens on Windows for MSBuild or Dotnet test process. Well, one not so clean solution would be to track process id (PID) that pipeline spawned and at the end of pipeline call task kill command to be sure that process is really stopped when ABORT is called.

@reschenburgIDBS
Copy link

for those coming here because of terraform, the below is working reasonably well for me. It allows to see terraform stdout live in the console, whilst also receiving the full output in an output file for post processing:

def terraformApplyCommand = "set -o pipefail; terraform apply -no-color -auto-approve ${config.planFile} 2>&1 | (trap 'kill -INT \$(pidof terraform)' TERM; tee ${config.planFile}_apply.txt)"

terraformApplyReturnCode = sh(label: 'Terraform apply', script: terraformApplyCommand, returnStatus: true)

@jimboca
Copy link

jimboca commented Jul 7, 2022

Looks like this all changed with current Jenkins versions:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L565
Calls killSoftly, which sends TERM and waits for softKillWaitSeconds, then gives up and sends kill in the killByKiller method.

Increasing softKillWaitSeconds does not work reliably due to other known race conditions in Jenkins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment