John Belmonte, 2022-Sep
In the previous installment, we introduced some basics of structured concurrency, and considered how this paradigm might fit within the Lua programming language.
An important point was encapsulation: the caller of a function shouldn't be concerned whether the implementation employs concurrency or not. We shouldn't have to give up features of the language or runtime just because our program contains concurrency.
A primary example is exception handling. In a basic Lua program without concurrency, we have the following expectations about error propagation:
-
"protected" function calls are possible, using
pcall()
. If an exception happens within the function scope, even transitively, we can catch it at runtime and decide what to do. -
if the program has an unhandled exception, a "traceback" will be available so that we can diagnose the problem. It contains not only the origin of the error, but the exact code path travelled to reach it.
However, once concurrency is added to a program, the problems begin. Any concurrency framework allowing coroutines to outlive the function that spawned them will violate these expectations. pcall()
becomes useless. Moreover, since each coroutine has its own call stack, and the relation among them is unknown (e.g., a spawned coroutine may in turn spawn another), the code path taken to reach the error is obscured.
The previous installment concluded with an example where a nursery's child task was altered to intentionally raise an error. The traceback exhibited the problems mentioned above, since our fledgling concurrency library was not taking advantage of the hierarchical structure of the tasks. Every task has a known parent—all the way up through the stack frame enclosing the nursery scope, and ultimately to the program's root task. The implementation should use this to propagate errors, and piece together a complete traceback.
Now that such an improvement has been applied, let's run the error example again:
trio = require('trio')
function await_error_example()
-- child task raises an error
do
local nursery <close> = trio.open_nursery()
nursery.start_soon(function()
trio.await_sleep(1)
error('oops')
end)
end
print('done')
end
trio.run(await_error_example)
$ lua example_3.lua
lua: example_3.lua:9: oops
task traceback:
[C]: in function 'error'
example_3.lua:9: in function <example_3.lua:7>
example_3.lua:10: in function 'await_error_example'
stack traceback:
[C]: in function 'error'
./trio.lua:194: in function 'trio.run'
example_3.lua:15: in main chunk
[C]: in ?
In addition to the normal "stack traceback" section, the unhandled exception output is preceded by a "task traceback", covering the path of the error within the concurrency framework. In this example, it includes the culprit lambda function, passed to start_soon()
, as well as the exit location of the nursery scope, correctly attributed to await_error_example()
.
It may seem that any coroutine scheduler could do this easily: just add the traceback of the coroutine that failed resume()
to the error string. But that assumes coroutines are only one level deep. Let's try a more complicated example, with nested tasks:
trio = require('trio')
function await_deeply_nested_error()
do
local nursery <close> = trio.open_nursery()
nursery.start_soon(function()
trio.await_sleep(1)
error('oops')
end)
end
end
function await_error_example()
do
local nursery <close> = trio.open_nursery()
nursery.start_soon(await_deeply_nested_error)
end
print('done')
end
trio.run(await_error_example)
$ lua example_4.lua
lua: example_4.lua:8: oops
task traceback:
[C]: in function 'error'
example_4.lua:8: in function <example_4.lua:6>
example_4.lua:9: in function 'await_deeply_nested_error'
example_4.lua:16: in function 'await_error_example'
stack traceback:
[C]: in function 'error'
./trio.lua:194: in function 'trio.run'
example_4.lua:21: in main chunk
[C]: in ?
The traceback above spans code executed from multiple tasks (coroutines), showing the precise path that resulted in the error. Moreover, since the error propagates through every intermediate task between the error and root of the program, pcall()
can intervene at any point:
function await_error_example()
result, err = pcall(function()
local nursery <close> = trio.open_nursery()
nursery.start_soon(await_deeply_nested_error)
-- ...
end)
if not result then
print('ignoring an error:', err)
end
print('done')
end
$ lua example_5.lua
ignoring an error: example_5.lua:8: oops
done
As far as error handling, we've demonstrated that structured concurrency allows concurrent programs to match the ease of non-concurrent programs.
Building on the toy structured concurrency module of the first installment, the implementation now propagates errors through the task hierarchy, and appends a "task traceback" to errors raised out of trio.run()
. (TODO: such error objects should be a table with __tostring
, so that the traceback is generated lazily, only if needed.)
The exception propagation works by deferring any error raised by a task's coroutine.resume()
, injecting it into the parent task when the parent is next advanced.
When assembling task tracebacks, care is taken to omit stack frames belonging to the concurrency implementation. This eliminates distracting clutter, especially when an error is traversing multiple tasks.
See code changes relative to the previous article installment.
When a task has an exception (or is otherwise cancelled), the children and siblings of that task should, in turn, be cancelled, and allowed to finalize. Currently, that isn't the case. For example:
trio = require('trio')
function await_cancellation_example()
-- among two parallel tasks, one fails
pcall(function()
local nursery <close> = trio.open_nursery()
nursery.start_soon(function()
print('child 1 start')
trio.await_sleep(1)
error('oops')
end)
nursery.start_soon(function()
print('child 2 start')
trio.await_sleep(2)
print('child 2 end')
end)
print('waiting for child tasks')
end)
print('done')
end
trio.run(await_cancellation_example)
$ lua example_6.lua
waiting for child tasks
child 2 start
child 1 start
done
child 2 end
One of the child tasks was allowed to run to completion, despite its sibling having raised an error into the nursery (which happens to be subsequently caught). Even worse, the task lifetime has leaked outside the nursery scope, with "child 2 end" appearing after "done".
We'll address these deficiencies and expand on cancellation—another area that benefits significantly from structured concurrency—in the next installment of this series.
continue reading: Structured concurrency and Lua (part 3)
article © 2022 John Belmonte, all rights reserved
@spc476 thank you-- yes, that's exactly the point I was making at the end of the article:
We'll fill in the deficiencies as the series (and toy implementation) progresses.