title |
---|
k-CAS for sweat-free concurrent programming |
k-CAS for sweat-free concurrent programming
Vesa Karvonen
(Press 's' for speaker notes, which are for self-study.)
Note: Welcome! This talk is based on my work polishing the OCaml multicore
kcas
library. It started
when Bartosz asked me to review a couple of his PRs on the library and I got
really interested in it as I have previously implemented prototypes of similar
libraries. So, I basically took the liberty to work on kcas
. I don't recall
having this much fun programming for a long time so I hope you'll forgive me for
pushing some of my other tasks aside for a moment!
- Lock-free programming with CAS and friends π°
- Sweat-free programming with k-CAS
Tx
s π - The lock-free k-CAS algorithm π€
- New obstruction-free k-CAS-n-CMP algorithm π€
- Blocking transactions β
Note: Here is a quick overview of what I plan to talk about. First I want to
give a super quick introduction to lock-free programming with CAS or
compare-and-set as a motivation for k-CAS. Then I'll talk about the new
transactional API for programming with k-CAS. After that I'll describe the k-CAS
algorithm that the library is based on. Next I'll motivate and describe a
brand-new algorithm for k-CAS-n-CMP based on the earlier algorithm. For the last
part we'll transport to the future. I'll sketch out an idea of how kcas
could
be extended to support blocking transactions and implement proper STM or
Composable Memory Transactions.
π€ or π° ?
Note: First I want to talk a little bit about lock-free programming with CAS. Is it the cowboy coder's dream and the professional programmer's nightmare?
An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress.
A panacea?
Note: But first, what does lock-free actually mean? Is it a cure for all diseases? Well, no. Starvation can be a real issue, for example. Also, lock-free algorithms do not necessarily perform better than algorithms using locks. They can cause just as much or even more contention as they tend to make more, not less, accesses to shared memory. I should also mention that so called wait-free algorithms also guarantee lack of starvation, but tend to come with a heavier price.
type 'a t
val create : unit -> 'a t
val push : 'a t -> 'a -> unit
val try_pop : 'a t -> 'a option
type 'a t = 'a list Atomic.t
let create () = Atomic.make []
Note: Enough with the theory. Let's implement a lock-free stack called the Treiber stack. Here is a completely standard signature for such a stack. You could implement that signature in many ways. We'll implement it using a single atomic location and CAS.
let rec push backoff stk x =
let xs = Atomic.get stk in
if not (Atomic.compare_and_set stk xs (x::xs)) then
push (Backoff.once backoff) stk xs
let push stk x = push Backoff.default stk x
let rec try_pop backoff stk =
match Atomic.get stk with
| [] -> None
| x::xs as xxs ->
if Atomic.compare_and_set stk xxs xs then Some x else
try_pop (Backoff.once backoff) stk
let try_pop stk = try_pop Backoff.default stk
Why is this lock-free? Β Is this faster than using a lock?
Note: The push and try_pop demonstrate typical programming patterns using CAS.
*
First of all, an Atomic.get
is performed to read the current value of the
atomic location. *
Then an Atomic.compare_and_set
is performed to update the
location. *
If that fails, a backoff is applied to avoid contention and the
operation is retried. *
Can you tell what makes this lock-free? Well, assuming
that the CPU guarantees that at least one correct CAS succeeds when multiple
such operations are attempted simultaneously then at least one domain will make
progress. As simple as the Treiber stack is, every retry causes costly
shared memory traffic.
A scalable spin-lock such as CLH spin-lock (Craig, Landin, and Hagersten) may
actually generate less shared memory traffic and perform better.
-
Lack of composability
- Challenge: Move an element between stacks atomically
-
Makes easy things possible and hard things research problems
Note: The Treiber stack is perhaps not the best example to demonstrate how subtle it is to program using CAS. As a challenge, you could try to implement an operation to move an element from one stack to another atomically. Good luck! The truth is that lock-free algorithms using CAS are extremely hard to reason about. That is part of why we also have tools to help with that. I should also mention that there are some known gotchas with CAS such as the so called ABA problem that can bite you e.g. when reusing objects. For lack of time I'm going to skip talking about that.
- DCAS, RDCSS, DCSS, 1-CAS-n-CMP, ...
- Special cases e.g. linked list operations
- k-CAS:
[CAS(x,0,1); CAS(y,1,0)]
- k-CAS-n-CMP
- Can be simulated with k-CAS as
CAS(x,a,a)
is equivalent toCMP(x,a)
aside from memory traffic.
- Can be simulated with k-CAS as
All of these can be implemented using just plain CAS!
Note: There have been various proposals for extended primitives. For example, 1-CAS-n-CMP can be used to implement operations on linked data structures such as lists and trees. Perhaps the ultimate primitive is k-CAS-n-CMP, which allows any finite number of locations to be updated atomically, while also requiring any finite number of other locations to remain unchanged during the operation.
let rec try_move backoff source target =
match Loc.get source with
| [] -> false
| (elem::rest) as old_source ->
let old_target = Loc.get target in
let ops = [
Op.make_cas source old_source rest;
Op.make_cas target old_target (elem::old_target)
] in
Op.atomically ops
|| try_move (Backoff.once backoff) source target
let try_move source target =
try_move Backoff.default source target
Note: Recall the challenge about moving an element from one stack to another
atomically. Well, with 2-CAS, or double-compare-and-swap, that prolem becomes
easier. *
First we read the values of the two stack locations. *
Then we
construct a list of the CAS operations. *
And attempt to perform them
atomically
. *
Finally we retry with backoff as usual.
But seriously speaking...
The standard Computer Science solution for this kind of problem is to create an abstraction mechanism.
π³
Note: The quote is from the paper Parallel Concurrent ML by Reppy, Russo, and Xiao. Of course, you still need the low-level ability to update locations atomically, but, indeed, is CAS the best programming interface for concurrent programming and, if not, what would be better?
π ?
Note: I will next introduce one potential abstraction on top of k-CAS. The idea being that instead of writing loops with backoffs creating and attempting lists of CAS operations, we instead use a transactional API, that allows us to read and write shared memory locations and that constructs the lists of operations and retries with backoffs for us automatically.
(** Shared memory locations. *)
module Loc : sig
type 'a t
val make : 'a -> 'a t
val get : 'a t -> 'a
(* ... *)
end
(** Transactions on shared memory locations. *)
module Tx : sig
type 'a t
val get : 'a Loc.t -> 'a t
val set : 'a Loc.t -> 'a -> unit t
val return : 'a -> 'a t
val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
val attempt : 'a t -> 'a
val commit : 'a t -> 'a
(* ... *)
end
Note: But what is a Tx
? Let's first take a peek at the signatures of the core
operations on transactions. The snippets on this slide show only the key
introduction, elimination, and composition forms for shared memory locations and
transactions. **
let a_loc = Loc.make 10
and b_loc = Loc.make 52
and x_loc = Loc.make 0
Tx.(
commit (
let* a_val = get a_loc in
let* b_val = get b_loc in
set x_loc (b_val - a_val)
)
)
assert ( Loc.get x_loc == 42 )
Note: Logically speaking, a transaction 'a Tx.t
is a composable operation that
may read and write shared memory locations and that can be performed atomically
by calling commit
. Another way to think about transactions is that they are
specifications for generating lists of CAS operations against shared memory
locations. **
val get : 'a Loc.t -> 'a t
val get_as : ('a -> 'b) -> 'a Loc.t -> 'b t
val set : 'a Loc.t -> 'a -> unit t
val update : 'a Loc.t -> ('a -> 'a) -> 'a t
val update_as : ('a -> 'b) -> 'a Loc.t -> ('a -> 'a) -> 'b t
val modify : 'a Loc.t -> ('a -> 'a) -> unit t
val exchange : 'a Loc.t -> 'a -> 'a t
val exchange_as : ('a -> 'b) -> 'a Loc.t -> 'a -> 'b t
val return : 'a -> 'a t
val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
val ( and* ) : 'a t -> 'b t -> ('a * 'b) t
val ( let+ ) : 'a t -> ('a -> 'b) -> 'b t
val ( and+ ) : 'a t -> 'b t -> ('a * 'b) t
val ( >> ) : 'ignore t -> 'a t -> 'a t
val ( >>. ) : 'ignore t -> 'a -> 'a t
val ( >>= ) : 'a t -> ('a -> 'b t) -> 'b t
val map : ('a -> 'b) -> 'a t -> 'b t
val delay : (unit -> 'a t) -> 'a t
Note: To make programming more convenient and to also allow some optimizations
the library actually provides more than just the minimum number of primitives
and combinators. Most of these should be easy to understand in terms of the
types and functional programming tradition. *
In update
the result value is
the previous value from the location. *
The _as
variants take an additional
function that is given the previous value from the location to compute the
result. *
All of the sequencing combinators can be expressed in terms of
return
and let*
. *
The delay
operation is important when side-effects,
such allocation of fresh objects, is needed at the start of a transaction. I
wish OCaml had special syntax for that.
module type Stack = sig
type 'a t
val make : unit -> 'a t
val push : 'a t -> 'a -> unit
val try_pop : 'a t -> 'a option
end
module Stack : Stack = struct
type 'a t = 'a list ref
let make () = ref []
let push stack elem = stack := elem :: !stack
let try_pop stack =
let elems = !stack in
match elems with
| [] -> None
| elem :: rest -> stack := rest; Some elem
end
Note: Let's build something real. Here is an entirely traditional, sequential
stack data structure. *
It uses a ref
to hold a list. *
push
conses an
element to the list. *
And try_pop
removes an element, taking care to handle
the case of an empty stack.
module type Stack = sig
type 'a t
val make : unit -> 'a t
val push : 'a t -> 'a -> unit Tx.t
val try_pop : 'a t -> 'a option Tx.t
end
module Stack : Stack = struct
type 'a t = 'a list Loc.t
let make () = Loc.make []
let push stack elem = Tx.modify stack (List.cons elem)
let try_pop stack = Tx.(
let* elems = get stack in
match elems with
| [] -> return None
| elem :: rest -> set stack rest >>. Some elem )
end
Note: And here is a transactional stack. First take note of the signature. The
*
push
and *
try_pop
operations return transactions. The implementation
uses *
shared memory locations. *
push
is pretty trivial using modify
.
*
try_pop
is more complex as it needs to handle the case of an empty stack.
Note that there are no retry loops or backoffs. The push
and try_pop
transactions are not significantly more complex than similar code using ref
s.
module type Stack = sig
type 'a t
val make : unit -> 'a t
val push : 'a t -> 'a -> unit Tx.t
val try_pop : 'a t -> 'a option Tx.t
end
module Stack : Stack = struct
type 'a t = 'a list Loc.t
let make () = Loc.make []
let push stack elem = Tx.modify stack (List.cons elem)
let try_pop stack = Tx.update_as hd_opt stack tl_safe
end
let hd_opt = function [] -> None | elem :: _ -> Some elem
let tl_safe = function [] -> [] | _ :: rest -> rest
Note: Here is a more concise implementation of try_pop
using *
update_as
.
It makes use of two useful list manipulation functions *
hd_opt
and *
tl_safe
. Aside from being more concise, this is also likely a bit faster as
this only makes a single access to the internal transaction log.
let a_stack = Stack.make ()
Tx.commit ( Stack.push a_stack 101 )
assert ( Tx.commit ( Stack.try_pop a_stack ) = Some 101 )
assert ( Tx.commit ( Stack.try_pop a_stack ) = None )
Why not bake in commit
to push
and try_pop
?
Note: To use the transactional stack, one just needs to call the stack to
prepare transactions and then commit
them. **
The commit
function takes
care of retrying transactions with randomized exponential backoff to avoid
contention. You might ask why we didn't just call commit
inside push
and
try_pop
. We'll come back to that soon.
module type Queue : sig
type 'a t
val make : unit -> 'a t
val enqueue : 'a t -> 'a -> unit Tx.t
val try_dequeue : 'a t -> 'a option Tx.t
end
module Queue : Queue = struct
type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
let make () = { front = Loc.make []; back = Loc.make () )
let enqueue q x = Tx.modify q.back (List.cons x)
let try_dequeue q = Tx.(
update q.front tl_safe >>= function
| x :: _ -> return (Some x)
| [] -> exchange_as List.rev q.back [] >>= function
| [] -> return None
| x :: xs -> set q.front xs >>. Some x
)
end
Note: So, here we have a lock-free queue that fits on a single slide and is
pretty much guaranteed to be free of concurrency issues. *
It uses two shared
memory locations for the front and back of the queue. *
The enqueue
operation is just a push to the back. *
The try_dequeue
operation is more
complicated as it needs to handle several cases. **
But is this queue any
good? It can actually perform quite well, as we'll see shortly, but it does have
one major weakness. Can you see it? *
The problem is that reversal of the back
might take a long time and that could cause consumers to starve. As an exercise
you could implement a transactional queue where both operations are O(1) and not
just in the amortized sense.
let a_queue = Queue.make ()
Tx.commit ( Queue.enqueue a_queue 101 )
assert (Tx.commit ( Queue.try_dequeue a_queue ) = Some 101)
assert (Tx.commit ( Queue.try_dequeue a_queue ) = None)
Again, why not bake in commit
?
Note: There is nothing new to learn on this slide. It just shows again how to use transactions.
Tx.(
commit (
Stack.push a_stack 3 >>
Stack.push a_stack 1 >>
Stack.push a_stack 4
)
)
Note: The reason we didn't bake in commit
is that composition of transactions
is not limited to primitive transactions. Here is an example of pushing three
elements **
to a transactional stack atomically.
Tx.(
commit (
Stack.try_pop a_stack >>= function
| Some element ->
Queue.enqueue a_queue element
| None ->
return ()
)
)
Note: Composition also isn't limited to module abstraction boundaries. Here is
an example of moving an element *
from a transactional stack to a *
transactional queue *
atomically. The transaction mechanism makes it trivial
to compose independently developed transactions and allows transactional data
structures and algorithms to be used for a wider variety of purposes.
The quick brown fox jumps over the lazy dog
Note: The Tx
abstraction that kcas
offers seems really nice, but surely it
is going to perform poorly. Indeed! What about performance? Is the transaction
mechanism a lazy dog and CAS the quick brown fox? Well, the mechanism definitely
adds overhead, but perhaps not as much as one might expect.
algorithm | mean | sd |
---|---|---|
Reagent MS queue | 0.045950 | 0.000023 |
1-lock queue | 0.027613 | 0.000086 |
Tx node queue |
0.022827 | 0.000014 |
Lockfree MS queue | 0.011395 | 0.000023 |
2-lock queue | 0.010299 | 0.000005 |
Tx 2-stack queue |
0.009895 | 0.000000 |
Always take benchmarks with plenty of salt!
Note: This table is based on running a queue benchmark that is part of the
Reagents library. The
timings vary between runs, but what is interesting is that the k-CAS Tx
based
queues perform quite well. It is actually pretty interesting that the simple
2-stack Tx
queue is competitive with the hand-written lock-free Michael-Scott
queue. One factor in this is likely that the 2-stack queue probably needs fewer
GC write-barriers. Also, a lot of time has been spent tweaking the k-CAS
implementation to minimize overheads. Finally, I should note that the
transactional node based queue is just a very straightforward implementation. I
would not be surprised if someone came up with a faster Tx
queue with constant
time operations.
π€ β€οΈ
Note: Next I'll describe the algorithms used by the kcas
library internally.
While I do not expect you to understand every detail of the algorithms after
this talk, I hope to convey a few key ideas and details.
Efficient Multi-word Compare and Swap
Guerraoui, Kogan, Marathe, Zablotchi
k+1
CASes per k
-CAS!
Note: kcas
is based on the GKMZ algorithm published in 2020 that achieves a
near optimal k+1
single word CASes per a k-CAS
. How is that possible? It is
a cool trick that is worth understanding. Let me explain.
type 'a loc = 'a state Atomic.t
and 'a state = { before : 'a; after : 'a; casn : casn }
and casn = status Atomic.t
and status = Undetermined of cass list | After | Before
and cass = CASS : 'a loc * 'a state -> cass
let make after = Atomic.make {
before = after;
after;
casn = Atomic.make After
}
internal CASS
-descriptor vs logical CAS
-operation
Why does this only need k+1
CASes?
Note: Fred Brooks has been attributed to saying that "Show me your flowcharts
and conceal your tables, and I shall continue to be mystified. Show me your
tables, and I won't usually need your flowcharts; they'll be obvious." So, if
you take a moment to think about the data structures, or tables, used in GKMZ,
it should be obvious why the algorithm achieves k+1
CASes. **
I actually
stopped reading the GKMZ paper as soon as I got the trick and just implemented
it following my intuition.
Consider the traditional approach:
type 'a loc = [ `Value of 'a | `Descriptor of _ ] Atomic.t
How many CASes are needed at minimum?
(* 2nd hint: *) `Value x => `Descriptor _ => `Value y
Take home: Can you use the GKMZ trick elsewhere?
Note: The traditional approach does more work. Can you tell how many CASes are
needed at minimum? The answer is 2*k + 1
. You will need at least one CAS to
switch a location to hold a Descriptor
and another CAS to switch it back to
holding a Value
. And then one more CAS is needed to complete k-CAS operation.
GKMZ avoids the second CAS simply by not requiring locations to hold just plain
values.
Given
let a = make 10 and b = make 52 and x = make 0
after
atomically [CAS(a, 10, 10); CAS(b, 52, 52); CAS(x, 0, 42)]
objects could look like this
let casn = Atomic.make After
let x = Atomic.make {before = 0; after = 42; casn}
let a = Atomic.make {before = 10; after = 10; casn}
let b = Atomic.make {before = 52; after = 52; casn}
Note: With GKMZ the locations (may) always hold state
descriptors. When
reading a location one then needs to check which of the values to read. Note
that one of the before
or after
values is essentially garbage. A real
implementation will also need to perform a release operation to avoid the
state
descriptors from pointing to inaccessible objects. It is a simple detail
to take care of after the operation has completed, but we'll skip that for lack
of time.
let get loc =
let state = Atomic.get loc in
if is_after state.casn then
state.after
else
state.before
let atomically logical_cas_list =
let casn = Atomic.make After in
let rec run cass = function
| [] ->
Atomic.set casn (Undetermined cass);
gkmz casn cass
| CAS (loc, before, after) :: rest ->
run (CASS (loc, {before; after; casn})::cass) rest
in
run [] logical_cas_list
Note: As mentioned, to read a location, as in get
, the operation needs to
determine *
which one of the values, before
or after
, to use. *
The
is_after
function, which we'll soon see, does that. The atomically
function,
*
converts logical CAS
operations into internal CASS
descriptors. An
important detail is that it *
creates a cyclic data structure. The cycle is
what makes it possible for other domains to get hold of the entire data
structure and help to complete an operation initiated by some domain. This is a
common pattern in non-blocking algorithms β it might make sense to support
let rec
with atomics.
let rec gkmz casn = function
| [] -> finish casn After
| CASS (loc, now) :: next as this ->
let old = Atomic.get loc in
if now == old then gkmz casn next else
let old_value =
if is_after old.casn then old.after else old.before in
if old_value != now.before then finish casn Before else
match Atomic.get casn with
| Undetermined _ ->
gkmz casn (if Atomic.compare_and_set loc old now
then next else this)
| determined -> determined == After
and is_after casn =
match Atomic.get casn with
| Undetermined cass -> gkmz casn cass
| determined -> determined == After
Note: This is the core of the GKMZ algorithm. It basically just goes through the
list of internal descriptors, *
compares the current value in each location,
*
checks that the operation is still undetermined, *
and then CASes the new
state to the location. After all the locations have been handled or an
inconsistency is detected, *
it calls finish
to attempt to complete the
operation. We'll look at finish
shortly. Note that is_after
and gkmz
are
mutually recursive. When is_after
is called, it may encounter an undetermined
operation. In such a case it calls gkmz
in order to complete the operation.
This is basically what allows the algorithm to be lock-free. No domain is forced
to indefinitely wait for another. BTW, in the beginning I mentioned that writing
lock-free algorithms is subtle business. *
The three Atomic
operations in
the core algorithm are done in a specific order. What could happen if you
changed that order? I'll leave that as an exercise.
let finish casn desired =
match Atomic.get casn with
| Undetermined _ as current ->
Atomic.compare_and_set casn current desired |> ignore;
Atomic.get casn == After
| determined -> determined == After
Note: The last part of the algorithm is finish
. *
It attempts the final CAS
to complete an operation *
and then it does one more check to see whether the
operation completed successfully or failed. This is also what makes the
algorithm atomic. Before the final CAS is performed, no domain may actually read
any of the CASed locations.
- Uses a splay tree to order CASS
descriptors
- GKMZ consumes internal descriptors anyway
- O(k) construction when CASes are ordered
- Also used as the transaction log
- Optimizes for recently accessed
- Performs a release step after final CAS
- Remove reference to unused
before
orafter
- Also frees stable
casn
descriptors
- Remove reference to unused
- Has many small tweaks to improve performance
Note: The implementation in these slides is, of course, a simplified version. The real implementation...
Could we do even better? π€
Note: The GKMZ algorithm is nearly optimal for k-CAS. What more could one want? Well, one thing that k-CAS lacks is efficient read-only CMP operations. Next I'll describe a brand-new obstruction-free k-CAS-n-CMP algorithm that extends GKMZ.
let a = Loc.make 10 and b = Loc.make 52
and x = Loc.make 0 and y = Loc.make 0
let ping () =
commit (let* a' = get a
and* b' = get b in
set x (b' - a'))
let pong () =
commit (let* a' = get a
and* b' = get b in
set y (a' + b'))
ping => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (x, 0, 42) ]
pong => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (y, 0, 62) ]
ping ) a b ( pong
Note: But before we go into the details, here is a motivating example using the
Tx
API. We have four disjoint locations a
, b
, x
, and y
. We have two
transactions that both read two locations a
and b
. One of the transactions
then writes to location x
and the other writes to location y
. What happens
when the two transactions are attempted in parallel? Well, they don't. GKMZ will
serialize the operations. To add insult to injury, a CAS also always writes to
memory. So, even though the a
and b
locations are not logically modified,
they are written to, which causes contention. The way
shared memory coherence protocols
work is that after a write to memory only the writer has a copy of that memory
location. Assume the above two transactions would be done sequentially. The
first transaction writes to a
and b
, which means that a
and b
are
evicted from all the caches except the writer's cache. The second transaction
then will have to fetch a
and b
more expensively and will then write to them
and again evict them from all the other caches. This is called cache-line
ping-pong.
Verify CMPs after CASes and before final CAS.
But that is only obstruction-free...
Note: How could we implement read-only compares? A simple idea would be to just verify that the compares still hold just before performing the final CAS. This sort of post hoc validation is actually a pretty common pattern in non-blocking algorithms. But it only gives us obstruction-freedom β not lock-freedom.
An algorithm is obstruction-free if at any point, a single thread executed in isolation for a bounded number of steps will complete its operation.
Why is a verification step not lock-free?
[CMP (x, 0); CAS (y, 0, 1)] and [CAS (x, 0, 1); CMP (y, 0)]
Note: Recall that lock-freedom requires that at least one domain makes progress.
Consider a domain performing a CMP on x
and CAS on y
and another domain
doing a CAS on x
and CMP on y
. Both may be able to enter the verification
step (can you see how?) and both would fail during the verification step. This
could repeat indefinitely and neither domain could make progress.
let atomically logical_cas_list =
let casn = Atomic.make After in
let rec run cass = function
| [] ->
Atomic.set casn (Undetermined cass);
gkmz casn cass
| CAS (loc, before, after) :: rest ->
run (CASS (loc, {before; after; casn}) :: cass) rest
+ | CMP (loc, exp) :: rest ->
+ let old = Atomic.get loc in
+ get loc == exp && Atomic.get loc == old &&
+ run (CASS (loc, current) :: cass) rest
in
run [] logical_cas_list
Note: Let's just see how to change GKMZ to give k-CAS-n-CMP. The changes are
shown as diffs against the reference GKMZ implementation. We modify atomically
such that it also accepts CMP
operations. It then translates those to CASS
descriptors without creating a new state
.
+let is_cmp casn state = state.casn != casn
let rec gkmz casn = function
| [] -> finish casn After
| CASS (loc, now) :: next as this ->
let old = Atomic.get loc in
if now == old then gkmz casn next else
+ if is_cmp casn now then finish casn Before else
let old_value =
if is_after old.casn then old.after else old.before in
if old_value != now.before then finish casn Before else
match Atomic.get casn with
| Undetermined _ ->
gkmz casn (if Atomic.compare_and_set loc old now
then next else this)
| determined -> determined == After
Note: So, how can one tell whether a CASS
is a comparison? We just check
whether it refers to the casn
of our operation or not. The casn
is always
freshly allocated by atomically
. If the new and old state
differ, we check
whether we were attempting a read-only compare. If so, then we conclude that the
operation failed.
let finish casn desired =
match Atomic.get casn with
- | Undetermined _ as current ->
+ | Undetermined cass as current ->
+ let desired =
+ if desired == After
+ && cass
+ |> List.exists @@ fun (CASS (loc, state)) ->
+ is_cmp casn state && Atomic.get loc != state then
+ Before
+ else
+ desired in
Atomic.compare_and_set casn current desired |> ignore;
Atomic.get casn == After
| determined -> determined == After
Note: The biggest change to the GKMZ algorithm is the addition of the post hoc
verification step. Just before attempting to complete the transaction we go
through all the CASS
descriptors and check once more that the read-only
comparisons are still valid.
let rec commit backoff mode xt =
match attempt mode xt with
| result -> result
| exception Interference ->
let backoff = Backoff.once backoff in
commit backoff Mode.lock_free xt
| exception Exit ->
let backoff = Backoff.once backoff in
commit backoff mode xt
Note: The actual implementation distinguishes between failure before and during
verification. If a transaction attempt fails during verification *
then the
Interference
exception is raised. This allows the commit
loop to *
switch
to lock-free mode, which only uses CAS operations and requires no verification
step. This guarantees that with commit
transactions are lock-free. So, we
don't actually lose anything!
module Queue : Queue = struct
type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
let make () = { front = Loc.make []; back = Loc.make () )
let enqueue q x = Tx.modify q.back (List.cons x)
let try_dequeue q = Tx.(
update q.front tl_safe >>= function
| x :: _ -> return (Some x)
| [] -> exchange_as List.rev q.back [] >>= function
| [] -> return None
| x :: xs -> set q.front xs >>. Some x
)
end
Does try_dequeue
perform read-only CMPs?
Note: Recall the queue implementation. If the queue is empty, the try_dequeue
transaction may only perform read-only comparisons. Also, if the front is empty
and the back contains only a single element, the front is not necessarily
written. That is because *
in those cases the try_dequeue
transaction never
writes a new value to those locations. The transaction mechanism is smart enough
to optimize that to a read-only compare. This means that multiple domains can
try to dequeue an element from an empty queue in parallel without interference.
The conditionals "may" and "not necessarily" are there to account for the
possibility of interference from writers.
β
Note: What we have so far is probably about as far as kcas
can be stretched
without adding significant dependencies or overheads, but perhaps we could have
a back door for an additional feature: blocking transactions. What is that?
stm @@ let* x = Queue.dequeue qa in
Queue.enqueue qb x
What if qa
is empty?
Note: Consider a transaction that dequeues a value x
from qa
and enqueues it
to qb
. Seems straightforward. But what if qa
is empty? Should we just keep
on retrying in the hope that some other domain pushes an element into qa
? That
sort of busy-wait is usually a poor idea.
let dequeue q =
Queue.try_dequeue q >>= function
| None ->
retry ()
| Some x ->
return x
Note: How might one even implement a dequeue
operation that always returns a
(non-optional) value? The idea here is that *
retry
suspends the
transaction, waits until some location accessed during the transaction is
mutated, and then restarts the transaction. This kind of retry
is a key idea
put forth in the
Composable Memory Transactions
paper.
module Loc : sig
type ('tag, 'a) tagged
val make_tagged: 'tag -> 'a -> ('tag, 'a) tagged
val get_tag : ('tag, 'a) tagged -> 'tag
type 'a t = (unit, 'a) t
(* ... *)
end
To associate waiters with a location.
Note: To support blocking, one essentially needs a way to signal waiters. After mutating some locations the mutator signals waiters. For a scalable mechanism that signal needs to be selective and only wake up those waiters that are interested in the mutated locations. To associate waiters with locations in a truly low-overhead fashion, one possibility would be to allow locations to be "tagged". In a blocking transaction mechanism that 'tag could be a bag of the waiters of changes to the location.
module Tx : sig
(* ... *)
module Log : sig
type t
type 'r reduce = {one: 't 'a. ('t, 'a) Loc.tagged -> 'r;
zero: 'r; plus: 'r -> 'r -> 'r}
val reduce : 'r reduce -> t -> 'r
end
exception Retry of unit t
val reset_and_retry : (Log.t -> unit t) -> 'a t
val with_written: 'a t -> (Log.t -> unit t) -> 'a t
end
Note: To make it easy for a blocking transaction mechanism to figure out which
locations need to be waited for, limited access to the transaction log could be
granted. *
reset_and_retry
returns a transaction that resets the current
transaction such that it only reads from the accessed locations. The given
function is then called with *
the internal transaction log to *
construct a
transaction that is then composed after the current transaction. The composed
transaction is then *
raised as a Retry
exception. *
with_written
returns a transaction that executes as the given transaction and then calls the
given function with a view of the internal transaction log restricted to the
written locations. The returned transaction is then composed after the whole
transaction.
type 'a t = 'a Tx.t
let retry () = Tx.reset_and_retry @@ fun log ->
(* add waiters to all (tagged) locations in log *)
let rec stm waiters tx =
match Tx.attempt @@ Tx.with_written tx @@ fun log ->
(* move waiters from (tagged) locations to waiters list *)
with
| result -> (* signal all waiters and then *) result
| exception Exit -> stm all_waiters tx
| exception Tx.Retry add_waiters_tx -> (
match Tx.attempt add_waiters_tx with
| () -> (* do blocking wait and retry *) stm waiters tx
| exception Exit ->
(* locations were mutated before waiters were added *)
stm tx)
let stm tx = stm (Loc.make []) tx
Note: Here is a quick sketch with some blanks left as an exercise of how one
could implement STM with blocking transactions with the help of tagged
locations, reset_and_retry
and with_written
. The key point here is the type
of blocking STM transactions on the first line. The point is that the blocking
STM would not need to add a layer on top of the lower level k-CAS transactions.
This should basically mean that blocking transactions could be implemented with
roughly half the performance of plain non-blocking transactions as every access
potentially requires also touching the waiters list. I'm curious to find out how
well this would perform against existing STM implementations!
-
Lock-free programming with CASes is hard π¦
-
Programming with
Tx
is easy by composi...arison -
Thanks to
- careful low-level optimizations, splay tree,
- new k-CAS-n-CMP algorithm, and
- interference detection
Tx
can give ok performance and lock-freedom
Tx
is sweat-free concurrency!
Note: Thank you for listening! I'm out of slides.