John Belmonte, 2022-Oct
In the previous installment, we demonstrated how structured concurrency enables magical tracebacks, spanning both stack frames and task hierarchy. Not only do we regain the ability to diagnose errors, but since exceptions propagate through the nursery of each task, we also have Lua's full protected-call facility at our disposal—just as if the code was not using coroutines.
Nurseries are powerful because they let us orchestrate a set of tasks, ensuring that their lifetime is well-understood, and allowing every task to properly finalize on cancellation of the group. Cancellation can happen if one of the child tasks has an exception, or is otherwise explicitly requested. Explicit cancellation turns out to be fairly common—done just as often as exiting imperative code loops.
Here, we'll try to gain a better understanding of how cancellation happens, and introduce the relevant API.
The previous installment concluded by uncovering a few holes our concurrency library regarding cancellation. Unexpectedly, a nursery's child task ran to completion, despite its sibling raising an error. The task's lifetime even leaked beyond the nursery's scope. Let's run the same example after some library corrections:
trio = require('trio')
function await_cancellation_example()
-- among two parallel tasks, one fails
pcall(function()
local nursery <close> = trio.open_nursery()
nursery.start_soon(function()
print('child 1 start')
trio.await_sleep(1)
error('oops')
end)
nursery.start_soon(function()
print('child 2 start')
trio.await_sleep(2)
print('child 2 end')
end)
print('waiting for child tasks')
end)
print('done')
end
trio.run(await_cancellation_example)
$ time lua example_6.lua
waiting for child tasks
child 2 start
child 1 start
done
real 0m1.093s
user 0m0.031s
sys 0m0.050s
Now, as expected, the program exits after one second, prompted by the task raising an error. Its sibling is cancelled. Though the error is raised up through the nursery, it's finally suppressed by the wrapping pcall
.
We'll focus on the cancelled task. How do things look from its point of view?
In the concurrency framework's design, task cancellation is signaled by an exception, called Cancelled
. Exceptions are well-suited for this job because:
- we've already worked hard to propagate exceptions up the task hierarchy
- in the simple case, a task doesn't need to do anything special to support being cancelled
- cancelled tasks can perform finalization, utilizing all of Lua's associated features
Here, we instrument "child 2" of the previous example, to see what's going on:
function await_cancellation_example()
pcall(function()
local nursery <close> = trio.open_nursery()
nursery.start_soon(function()
print('child 1 start')
trio.await_sleep(1)
error('oops')
end)
nursery.start_soon(function()
print('child 2 start')
local status, err = pcall(function() trio.await_sleep(2) end)
assert(not status)
print('child 2 did not finish:', err)
error(err) -- important: re-raise
end)
print('waiting for child tasks')
end)
print('done')
end
$ lua example_7.lua
waiting for child tasks
child 2 start
child 1 start
child 2 did not finish: Cancelled
done
This confirms "child 2" received the Cancelled
exception. Upon receiving an error from one of its children, the nursery enters a cancelled state, and arranges for Cancelled
to be injected into the remaining active children. The coroutine of each such child is soon resumed, encounters the injected exception, and has its stack unwound. The final destination of any Cancelled
exception is the nursery that triggered it. The nursery treats this as a normal event, and doesn't raise the exception to its parent.
As shown in the example, if your code happens to be catching errors, it's required to always re-raise Cancelled
. In Lua, this kind of operation is prone to mistakes, so care is needed. (Hint: make utilities like finally(f)
, on_error(f)
, or on_success(f)
that use to-be-closed variables to call the given finalizer, automatically propagating errors.)
Nurseries represent a set of spawned tasks that can be cancelled as a unit. You may be expecting that tasks would explicitly cancel their parent nursery by raising Cancelled
, but actually, this is not allowed. Rather, the nursery object has an idempotent cancel()
method for this purpose. The rationale is as follows:
- it's useful for actors outside of a nursery to cancel it
- if the body of a nursery itself raised
Cancelled
, it would cancel not only that nursery's children, but any parent nursery as well (of which it is a child) nursery.cancel()
allows an entire tree of the task hierarchy to be queued for cancellation atomically. Otherwise, propagation of the cancellation may take several scheduler passes, with some children continuing for a brief time, even though an ancestor is cancelled. (NOTE: not yet implemented)
So given cancel()
, let's implement await_any()
, complementing the await_all()
utility introduced in part 1 of this series:
function await_any(...)
local nursery <close> = trio.open_nursery()
for _, await_f in ipairs({...}) do
nursery.start_soon(function()
await_f()
nursery.cancel()
end)
end
end
Like await_all()
, this function is intended for a set of heterogeneous tasks, run for their side effects. It has short-circuit behavior, returning as soon as one of the given awaitables returns. This is implemented by a child task calling cancel()
on the nursery once its corresponding awaitable completes.
Synopsis:
event = trio.Event()
-- (then hand the event to some other task ...)
await_any(
event.await,
await_serve_foo,
)
Note that cancel()
is a synchronous function, only queueing a request to cancel the nursery. It won't propagate until the code calling cancel()
yields execution. Even then, there is no guarantee that each nursery task will receive a Cancelled
exception. For example, some tasks may have already exited normally (or by error) in the same scheduler pass as the task requesting cancel.
On the subject of exceptions, there's an elephant in the room: if you have concurrency, then errors can be concurrent, and it's feasible to have multiple, unrelated exceptions in flight.
There are two ways this might happen. The first is where the exceptions are truly simultaneous, happening within the same scheduler pass. Since tasks within a pass are resumed in arbitrary order, it follows that the exception initially encountered will also be arbitrary. The second way is related to allowing tasks to finalize when there is an error or cancellation. The nursery will wait for all children to finalize, and may be presented with a set of exceptions: the error that induced cancellation, Cancelled
from the corresponding sibling tasks, and other errors that may arise during finalization of each task.
How serious is this? It depends on the application. If errors tend to go unhandled, then it probably doesn't matter which one surfaced. In contrast, if a program includes the catching of specific errors, then such handling may be significant for maintaining invariants, and some errors may be more important than others. In such applications, choosing the winning exception arbitrarily may cause problems. Python has acknowledged concurrent exceptions as worthy of investment, recently introducing ExceptionGroup
and concurrency-aware catch syntax.
As for our Lua concurrency library, we'll keep the implementation simple, with a model of "propagate the first exception". (Internally, Cancelled
will be given lower priority than other exceptions.) But this is an area to keep in mind for improvement.
Various changes to the implementation were needed to fill out task cancellation support. There are several cases to get right: an error or cancel()
call can arrive during the nursery's body, or after it's already blocked in the finalizer.
Use of the Cancelled
exception exposed correctness issues in the scheduler. Rather than keeping only a list of active tasks sorted by wake time, it's also necessary to have a collection of tasks that are pending error injection, so that the errors can be promptly delivered on the next scheduler pass.
Proper unwinding of coroutines ending in error is now performed, via coroutine.close()
, so that to-be-closed variables within tasks will be finalized. I learned that debug.traceback()
must be called before such unwinding. Unfortunately, this traceback overhead is necessary even if the error is ultimately caught and suppressed.
See code changes relative to the previous article installment.
Managing timeouts is something that is surprisingly hard in a program. It requires forethought by the author of every API you use. Even when timeout control is available, composing timeouts correctly across a sequence of blocking calls presents a challenge. There's a way to solve this generally, without explicitly plumbing timeout and cancel support down the call stack, or intricate bookkeeping. And while we're at it, it would be nice to cancel such operations on any condition, not only elapsed time. For this we'll introduce cancel scopes—only possible thanks to structured concurrency—in the next installment of this series.
article © 2022 John Belmonte, all rights reserved
For anyone interested in using this - I continued @belm0's work here and packaged it into a library