polytypic/slides.md

title
k-CAS for sweat-free concurrent programming

k-CAS for sweat-free concurrent programming

Vesa Karvonen

(Press 's' for speaker notes, which are for self-study.)

Note: Welcome! This talk is based on my work polishing the OCaml multicore kcas library. It started when Bartosz asked me to review a couple of his PRs on the library and I got really interested in it as I have previously implemented prototypes of similar libraries. So, I basically took the liberty to work on kcas. I don't recall having this much fun programming for a long time so I hope you'll forgive me for pushing some of my other tasks aside for a moment!

Note: Here is a quick overview of what I plan to talk about. First I want to give a super quick introduction to lock-free programming with CAS or compare-and-set as a motivation for k-CAS. Then I'll talk about the new transactional API for programming with k-CAS. After that I'll describe the k-CAS algorithm that the library is based on. Next I'll motivate and describe a brand-new algorithm for k-CAS-n-CMP based on the earlier algorithm. For the last part we'll transport to the future. I'll sketch out an idea of how kcas could be extended to support blocking transactions and implement proper STM or Composable Memory Transactions.

1. Lock-free programming with CAS and friends

🤠 or 😰 ?

Note: First I want to talk a little bit about lock-free programming with CAS. Is it the cowboy coder's dream and the professional programmer's nightmare?

Lock-free?

An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress.

A panacea?

On the Nature of Progress

Note: But first, what does lock-free actually mean? Is it a cure for all diseases? Well, no. Starvation can be a real issue, for example. Also, lock-free algorithms do not necessarily perform better than algorithms using locks. They can cause just as much or even more contention as they tend to make more, not less, accesses to shared memory. I should also mention that so called wait-free algorithms also guarantee lack of starvation, but tend to come with a heavier price.

Treiber stack (1/2)

type 'a t
val create : unit -> 'a t
val push : 'a t -> 'a -> unit
val try_pop : 'a t -> 'a option

type 'a t = 'a list Atomic.t

let create () = Atomic.make []

Note: Enough with the theory. Let's implement a lock-free stack called the Treiber stack. Here is a completely standard signature for such a stack. You could implement that signature in many ways. We'll implement it using a single atomic location and CAS.

Treiber stack (2/2)

let rec push backoff stk x =
  let xs = Atomic.get stk in
  if not (Atomic.compare_and_set stk xs (x::xs)) then
    push (Backoff.once backoff) stk xs

let push stk x = push Backoff.default stk x

let rec try_pop backoff stk =
  match Atomic.get stk with
  | [] -> None
  | x::xs as xxs ->
    if Atomic.compare_and_set stk xxs xs then Some x else
      try_pop (Backoff.once backoff) stk

let try_pop stk = try_pop Backoff.default stk

Why is this lock-free? Is this faster than using a lock?

Note: The push and try_pop demonstrate typical programming patterns using CAS. * First of all, an Atomic.get is performed to read the current value of the atomic location. * Then an Atomic.compare_and_set is performed to update the location. * If that fails, a backoff is applied to avoid contention and the operation is retried. * Can you tell what makes this lock-free? Well, assuming that the CPU guarantees that at least one correct CAS succeeds when multiple such operations are attempted simultaneously then at least one domain will make progress. As simple as the Treiber stack is, every retry causes costly shared memory traffic. A scalable spin-lock such as CLH spin-lock (Craig, Landin, and Hagersten) may actually generate less shared memory traffic and perform better.

CAS causes sweating

Lack of composability
- Challenge: Move an element between stacks atomically
Makes easy things possible and hard things research problems
- TSan, dscheck, ...
ABA problem

Note: The Treiber stack is perhaps not the best example to demonstrate how subtle it is to program using CAS. As a challenge, you could try to implement an operation to move an element from one stack to another atomically. Good luck! The truth is that lock-free algorithms using CAS are extremely hard to reason about. That is part of why we also have tools to help with that. I should also mention that there are some known gotchas with CAS such as the so called ABA problem that can bite you e.g. when reusing objects. For lack of time I'm going to skip talking about that.

If CAS doesn't work, you aren't using enough

DCAS, RDCSS, DCSS, 1-CAS-n-CMP, ...
- Special cases e.g. linked list operations
k-CAS: [CAS(x,0,1); CAS(y,1,0)]
k-CAS-n-CMP
- Can be simulated with k-CAS as CAS(x,a,a) is equivalent to CMP(x,a) aside from memory traffic.

All of these can be implemented using just plain CAS!

Note: There have been various proposals for extended primitives. For example, 1-CAS-n-CMP can be used to implement operations on linked data structures such as lists and trees. Perhaps the ultimate primitive is k-CAS-n-CMP, which allows any finite number of locations to be updated atomically, while also requiring any finite number of other locations to remain unchanged during the operation.

Move between stacks using 2-CAS

let rec try_move backoff source target =
  match Loc.get source with
  | [] -> false
  | (elem::rest) as old_source ->
    let old_target = Loc.get target in
    let ops = [
      Op.make_cas source old_source rest;
      Op.make_cas target old_target (elem::old_target)
    ] in
    Op.atomically ops
    || try_move (Backoff.once backoff) source target

let try_move source target =
  try_move Backoff.default source target

Note: Recall the challenge about moving an element from one stack to another atomically. Well, with 2-CAS, or double-compare-and-swap, that prolem becomes easier. * First we read the values of the two stack locations. * Then we construct a list of the CAS operations. * And attempt to perform them atomically. * Finally we retry with backoff as usual.

But seriously speaking...

The standard Computer Science solution for this kind of problem is to create an abstraction mechanism.

😳

Note: The quote is from the paper Parallel Concurrent ML by Reppy, Russo, and Xiao. Of course, you still need the low-level ability to update locations atomically, but, indeed, is CAS the best programming interface for concurrent programming and, if not, what would be better?

2. Sweat-free programming with k-CAS `Tx`s

😌 ?

Note: I will next introduce one potential abstraction on top of k-CAS. The idea being that instead of writing loops with backoffs creating and attempting lists of CAS operations, we instead use a transactional API, that allows us to read and write shared memory locations and that constructs the lists of operations and retries with backoffs for us automatically.

What is a `Tx`?

(** Shared memory locations. *)
module Loc : sig
  type 'a t
  val make : 'a -> 'a t
  val get : 'a t -> 'a
  (* ... *)
end

(** Transactions on shared memory locations. *)
module Tx : sig
  type 'a t
  val get : 'a Loc.t -> 'a t
  val set : 'a Loc.t -> 'a -> unit t
  val return : 'a -> 'a t
  val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
  val attempt : 'a t -> 'a
  val commit : 'a t -> 'a
  (* ... *)
end

Note: But what is a Tx? Let's first take a peek at the signatures of the core operations on transactions. The snippets on this slide show only the key introduction, elimination, and composition forms for shared memory locations and transactions. **

The answer

let a_loc = Loc.make 10
and b_loc = Loc.make 52
and x_loc = Loc.make 0

Tx.(
  commit (
    let* a_val = get a_loc in
    let* b_val = get b_loc in
    set x_loc (b_val - a_val)
  )
)

assert ( Loc.get x_loc == 42 )

Note: Logically speaking, a transaction 'a Tx.t is a composable operation that may read and write shared memory locations and that can be performed atomically by calling commit. Another way to think about transactions is that they are specifications for generating lists of CAS operations against shared memory locations. **

More `Tx` combinators

val get : 'a Loc.t -> 'a t
val get_as : ('a -> 'b) -> 'a Loc.t -> 'b t
val set : 'a Loc.t -> 'a -> unit t
val update : 'a Loc.t -> ('a -> 'a) -> 'a t
val update_as : ('a -> 'b) -> 'a Loc.t -> ('a -> 'a) -> 'b t
val modify : 'a Loc.t -> ('a -> 'a) -> unit t
val exchange : 'a Loc.t -> 'a -> 'a t
val exchange_as : ('a -> 'b) -> 'a Loc.t -> 'a -> 'b t

val return : 'a -> 'a t
val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
val ( and* ) : 'a t -> 'b t -> ('a * 'b) t
val ( let+ ) : 'a t -> ('a -> 'b) -> 'b t
val ( and+ ) : 'a t -> 'b t -> ('a * 'b) t
val ( >> ) : 'ignore t -> 'a t -> 'a t
val ( >>. ) : 'ignore t -> 'a -> 'a t
val ( >>= ) : 'a t -> ('a -> 'b t) -> 'b t
val map : ('a -> 'b) -> 'a t -> 'b t
val delay : (unit -> 'a t) -> 'a t

Note: To make programming more convenient and to also allow some optimizations the library actually provides more than just the minimum number of primitives and combinators. Most of these should be easy to understand in terms of the types and functional programming tradition. * In update the result value is the previous value from the location. * The _as variants take an additional function that is given the previous value from the location to compute the result. * All of the sequencing combinators can be expressed in terms of return and let*. * The delay operation is important when side-effects, such allocation of fresh objects, is needed at the start of a transaction. I wish OCaml had special syntax for that.

A stack

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit
  val try_pop : 'a t -> 'a option
end

module Stack : Stack = struct
  type 'a t = 'a list ref
  let make () = ref []
  let push stack elem = stack := elem :: !stack
  let try_pop stack =
      let elems = !stack in
      match elems with
      | [] -> None
      | elem :: rest -> stack := rest; Some elem
end

Note: Let's build something real. Here is an entirely traditional, sequential stack data structure. * It uses a ref to hold a list. * push conses an element to the list. * And try_pop removes an element, taking care to handle the case of an empty stack.

A transactional stack (1/2)

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit Tx.t
  val try_pop : 'a t -> 'a option Tx.t
end

module Stack : Stack = struct
  type 'a t = 'a list Loc.t
  let make () = Loc.make []
  let push stack elem = Tx.modify stack (List.cons elem)
  let try_pop stack = Tx.(
      let* elems = get stack in
      match elems with
      | [] -> return None
      | elem :: rest -> set stack rest >>. Some elem )
end

Note: And here is a transactional stack. First take note of the signature. The * push and * try_pop operations return transactions. The implementation uses * shared memory locations. * push is pretty trivial using modify. * try_pop is more complex as it needs to handle the case of an empty stack. Note that there are no retry loops or backoffs. The push and try_pop transactions are not significantly more complex than similar code using refs.

A transactional stack (2/2)

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit Tx.t
  val try_pop : 'a t -> 'a option Tx.t
end

module Stack : Stack = struct
  type 'a t = 'a list Loc.t
  let make () = Loc.make []
  let push stack elem = Tx.modify stack (List.cons elem)
  let try_pop stack = Tx.update_as hd_opt stack tl_safe
end

let hd_opt = function [] -> None | elem :: _ -> Some elem
let tl_safe = function [] -> [] | _ :: rest -> rest

Note: Here is a more concise implementation of try_pop using * update_as. It makes use of two useful list manipulation functions * hd_opt and * tl_safe. Aside from being more concise, this is also likely a bit faster as this only makes a single access to the internal transaction log.

Using a transactional stack

let a_stack = Stack.make ()

Tx.commit ( Stack.push a_stack 101 )

assert ( Tx.commit ( Stack.try_pop a_stack ) = Some 101 )

assert ( Tx.commit ( Stack.try_pop a_stack ) = None )

Why not bake in commit to push and try_pop?

Note: To use the transactional stack, one just needs to call the stack to prepare transactions and then commit them. ** The commit function takes care of retrying transactions with randomized exponential backoff to avoid contention. You might ask why we didn't just call commit inside push and try_pop. We'll come back to that soon.

A transactional queue

module type Queue : sig
  type 'a t
  val make : unit -> 'a t
  val enqueue : 'a t -> 'a -> unit Tx.t
  val try_dequeue : 'a t -> 'a option Tx.t
end

module Queue : Queue = struct
  type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
  let make () = { front = Loc.make []; back = Loc.make () )
  let enqueue q x = Tx.modify q.back (List.cons x)
  let try_dequeue q = Tx.(
    update q.front tl_safe >>= function
    | x :: _ -> return (Some x)
    | [] -> exchange_as List.rev q.back [] >>= function
              | [] -> return None
              | x :: xs -> set q.front xs >>. Some x
  )
end

Note: So, here we have a lock-free queue that fits on a single slide and is pretty much guaranteed to be free of concurrency issues. * It uses two shared memory locations for the front and back of the queue. * The enqueue operation is just a push to the back. * The try_dequeue operation is more complicated as it needs to handle several cases. ** But is this queue any good? It can actually perform quite well, as we'll see shortly, but it does have one major weakness. Can you see it? * The problem is that reversal of the back might take a long time and that could cause consumers to starve. As an exercise you could implement a transactional queue where both operations are O(1) and not just in the amortized sense.

The mother of learning

let a_queue = Queue.make ()

Tx.commit ( Queue.enqueue a_queue 101 )

assert (Tx.commit ( Queue.try_dequeue a_queue ) = Some 101)

assert (Tx.commit ( Queue.try_dequeue a_queue ) = None)

Again, why not bake in commit?

Note: There is nothing new to learn on this slide. It just shows again how to use transactions.

Because `Tx`s compose (1/2)

Tx.(
  commit (
    Stack.push a_stack 3 >>
    Stack.push a_stack 1 >>
    Stack.push a_stack 4
  )
)

Note: The reason we didn't bake in commit is that composition of transactions is not limited to primitive transactions. Here is an example of pushing three elements ** to a transactional stack atomically.

Because `Tx`s compose (2/2)

Tx.(
  commit (
    Stack.try_pop a_stack >>= function
    | Some element ->
      Queue.enqueue a_queue element
    | None ->
      return ()
  )
)

Note: Composition also isn't limited to module abstraction boundaries. Here is an example of moving an element * from a transactional stack to a * transactional queue * atomically. The transaction mechanism makes it trivial to compose independently developed transactions and allows transactional data structures and algorithms to be used for a wider variety of purposes.

The quick brown fox jumps over the lazy dog

Is it a 🦊 or a 🐶 ?

Note: The Tx abstraction that kcas offers seems really nice, but surely it is going to perform poorly. Indeed! What about performance? Is the transaction mechanism a lazy dog and CAS the quick brown fox? Well, the mechanism definitely adds overhead, but perhaps not as much as one might expect.

Maybe it is worth giving a try

algorithm	mean	sd
Reagent MS queue	0.045950	0.000023
1-lock queue	0.027613	0.000086
`Tx` node queue	0.022827	0.000014
Lockfree MS queue	0.011395	0.000023
2-lock queue	0.010299	0.000005
`Tx` 2-stack queue	0.009895	0.000000

Always take benchmarks with plenty of salt!

Note: This table is based on running a queue benchmark that is part of the Reagents library. The timings vary between runs, but what is interesting is that the k-CAS Tx based queues perform quite well. It is actually pretty interesting that the simple 2-stack Tx queue is competitive with the hand-written lock-free Michael-Scott queue. One factor in this is likely that the 2-stack queue probably needs fewer GC write-barriers. Also, a lot of time has been spent tweaking the k-CAS implementation to minimize overheads. Finally, I should note that the transactional node based queue is just a very straightforward implementation. I would not be surprised if someone came up with a faster Tx queue with constant time operations.

3. The lock-free GKMZ k-CAS algorithm

🤓 ❤️

Note: Next I'll describe the algorithms used by the kcas library internally. While I do not expect you to understand every detail of the algorithms after this talk, I hope to convey a few key ideas and details.

GKMZ?

Efficient Multi-word Compare and Swap

Guerraoui, Kogan, Marathe, Zablotchi

k+1 CASes per k-CAS!

Note: kcas is based on the GKMZ algorithm published in 2020 that achieves a near optimal k+1 single word CASes per a k-CAS. How is that possible? It is a cool trick that is worth understanding. Let me explain.

The tables — I mean the types

 type 'a loc = 'a state Atomic.t
 and 'a state = { before : 'a; after : 'a; casn : casn }
 and casn = status Atomic.t
 and status = Undetermined of cass list | After | Before
 and cass = CASS : 'a loc * 'a state -> cass

let make after = Atomic.make {
    before = after;
    after;
    casn = Atomic.make After
  }

internal CASS-descriptor vs logical CAS-operation

Why does this only need k+1 CASes?

Note: Fred Brooks has been attributed to saying that "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious." So, if you take a moment to think about the data structures, or tables, used in GKMZ, it should be obvious why the algorithm achieves k+1 CASes. ** I actually stopped reading the GKMZ paper as soon as I got the trick and just implemented it following my intuition.

Hint

Consider the traditional approach:

type 'a loc = [ `Value of 'a | `Descriptor of _ ] Atomic.t

How many CASes are needed at minimum?

(* 2nd hint: *) `Value x => `Descriptor _ => `Value y

Take home: Can you use the GKMZ trick elsewhere?

Note: The traditional approach does more work. Can you tell how many CASes are needed at minimum? The answer is 2*k + 1. You will need at least one CAS to switch a location to hold a Descriptor and another CAS to switch it back to holding a Value. And then one more CAS is needed to complete k-CAS operation. GKMZ avoids the second CAS simply by not requiring locations to hold just plain values.

Example

Given

let a = make 10 and b = make 52 and x = make 0

after

atomically [CAS(a, 10, 10); CAS(b, 52, 52); CAS(x, 0, 42)]

objects could look like this

let casn = Atomic.make After
let x = Atomic.make {before = 0; after = 42; casn}
let a = Atomic.make {before = 10; after = 10; casn}
let b = Atomic.make {before = 52; after = 52; casn}

Note: With GKMZ the locations (may) always hold state descriptors. When reading a location one then needs to check which of the values to read. Note that one of the before or after values is essentially garbage. A real implementation will also need to perform a release operation to avoid the state descriptors from pointing to inaccessible objects. It is a simple detail to take care of after the operation has completed, but we'll skip that for lack of time.

The GKMZ algorithm (1/3)

 let get loc =
   let state = Atomic.get loc in
   if is_after state.casn then
     state.after
   else
     state.before

 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let rec run cass = function
     | [] ->
       Atomic.set casn (Undetermined cass);
       gkmz casn cass
     | CAS (loc, before, after) :: rest ->
       run (CASS (loc, {before; after; casn})::cass) rest
   in
   run [] logical_cas_list

Note: As mentioned, to read a location, as in get, the operation needs to determine * which one of the values, before or after, to use. * The is_after function, which we'll soon see, does that. The atomically function, * converts logical CAS operations into internal CASS descriptors. An important detail is that it * creates a cyclic data structure. The cycle is what makes it possible for other domains to get hold of the entire data structure and help to complete an operation initiated by some domain. This is a common pattern in non-blocking algorithms — it might make sense to support let rec with atomics.

The GKMZ algorithm (2/3)

let rec gkmz casn = function
  | [] -> finish casn After
  | CASS (loc, now) :: next as this ->
    let old = Atomic.get loc in
    if now == old then gkmz casn next else
    let old_value =
      if is_after old.casn then old.after else old.before in
    if old_value != now.before then finish casn Before else
    match Atomic.get casn with
    | Undetermined _ ->
      gkmz casn (if Atomic.compare_and_set loc old now
                 then next else this)
    | determined -> determined == After

 and is_after casn =
   match Atomic.get casn with
   | Undetermined cass -> gkmz casn cass
   | determined -> determined == After

Note: This is the core of the GKMZ algorithm. It basically just goes through the list of internal descriptors, * compares the current value in each location, * checks that the operation is still undetermined, * and then CASes the new state to the location. After all the locations have been handled or an inconsistency is detected, * it calls finish to attempt to complete the operation. We'll look at finish shortly. Note that is_after and gkmz are mutually recursive. When is_after is called, it may encounter an undetermined operation. In such a case it calls gkmz in order to complete the operation. This is basically what allows the algorithm to be lock-free. No domain is forced to indefinitely wait for another. BTW, in the beginning I mentioned that writing lock-free algorithms is subtle business. * The three Atomic operations in the core algorithm are done in a specific order. What could happen if you changed that order? I'll leave that as an exercise.

The GKMZ algorithm (3/3)

 let finish casn desired =
   match Atomic.get casn with
   | Undetermined _ as current ->
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After
   | determined -> determined == After

Note: The last part of the algorithm is finish. * It attempts the final CAS to complete an operation * and then it does one more check to see whether the operation completed successfully or failed. This is also what makes the algorithm atomic. Before the final CAS is performed, no domain may actually read any of the CASed locations.

The real implementation

Uses a splay tree to order CASS descriptors
- GKMZ consumes internal descriptors anyway
- O(k) construction when CASes are ordered
- Also used as the transaction log
  - Optimizes for recently accessed
Performs a release step after final CAS
- Remove reference to unused before or after
- Also frees stable casn descriptors
Has many small tweaks to improve performance

Note: The implementation in these slides is, of course, a simplified version. The real implementation...

4. New obstruction-free k-CAS-n-CMP algorithm

Could we do even better? 🤔

Note: The GKMZ algorithm is nearly optimal for k-CAS. What more could one want? Well, one thing that k-CAS lacks is efficient read-only CMP operations. Next I'll describe a brand-new obstruction-free k-CAS-n-CMP algorithm that extends GKMZ.

Motivation

let a = Loc.make 10 and b = Loc.make 52
and x = Loc.make 0 and y = Loc.make 0

let ping () =
  commit (let* a' = get a
          and* b' = get b in
          set x (b' - a'))

let pong () =
  commit (let* a' = get a
          and* b' = get b in
          set y (a' + b'))

ping => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (x, 0, 42) ]
pong => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (y, 0, 62) ]

ping ) a b ( pong

Note: But before we go into the details, here is a motivating example using the Tx API. We have four disjoint locations a, b, x, and y. We have two transactions that both read two locations a and b. One of the transactions then writes to location x and the other writes to location y. What happens when the two transactions are attempted in parallel? Well, they don't. GKMZ will serialize the operations. To add insult to injury, a CAS also always writes to memory. So, even though the a and b locations are not logically modified, they are written to, which causes contention. The way shared memory coherence protocols work is that after a write to memory only the writer has a copy of that memory location. Assume the above two transactions would be done sequentially. The first transaction writes to a and b, which means that a and b are evicted from all the caches except the writer's cache. The second transaction then will have to fetch a and b more expensively and will then write to them and again evict them from all the other caches. This is called cache-line ping-pong.

🤔 an Idea!

Verify CMPs after CASes and before final CAS.

But that is only obstruction-free...

Note: How could we implement read-only compares? A simple idea would be to just verify that the compares still hold just before performing the final CAS. This sort of post hoc validation is actually a pretty common pattern in non-blocking algorithms. But it only gives us obstruction-freedom — not lock-freedom.

Obstruction-free

An algorithm is obstruction-free if at any point, a single thread executed in isolation for a bounded number of steps will complete its operation.

Why is a verification step not lock-free?

[CMP (x, 0); CAS (y, 0, 1)] and [CAS (x, 0, 1); CMP (y, 0)]

Note: Recall that lock-freedom requires that at least one domain makes progress. Consider a domain performing a CMP on x and CAS on y and another domain doing a CAS on x and CMP on y. Both may be able to enter the verification step (can you see how?) and both would fail during the verification step. This could repeat indefinitely and neither domain could make progress.

The k-CAS-n-CMP algorithm (1/3)

 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let rec run cass = function
     | [] ->
       Atomic.set casn (Undetermined cass);
       gkmz casn cass
     | CAS (loc, before, after) :: rest ->
       run (CASS (loc, {before; after; casn}) :: cass) rest
+    | CMP (loc, exp) :: rest ->
+      let old = Atomic.get loc in
+      get loc == exp && Atomic.get loc == old &&
+      run (CASS (loc, current) :: cass) rest
   in
   run [] logical_cas_list

Note: Let's just see how to change GKMZ to give k-CAS-n-CMP. The changes are shown as diffs against the reference GKMZ implementation. We modify atomically such that it also accepts CMP operations. It then translates those to CASS descriptors without creating a new state.

The k-CAS-n-CMP algorithm (2/3)

+let is_cmp casn state = state.casn != casn

 let rec gkmz casn = function
   | [] -> finish casn After
   | CASS (loc, now) :: next as this ->
     let old = Atomic.get loc in
     if now == old then gkmz casn next else
+    if is_cmp casn now then finish casn Before else
     let old_value =
       if is_after old.casn then old.after else old.before in
     if old_value != now.before then finish casn Before else
     match Atomic.get casn with
     | Undetermined _ ->
       gkmz casn (if Atomic.compare_and_set loc old now
                  then next else this)
     | determined -> determined == After

Note: So, how can one tell whether a CASS is a comparison? We just check whether it refers to the casn of our operation or not. The casn is always freshly allocated by atomically. If the new and old state differ, we check whether we were attempting a read-only compare. If so, then we conclude that the operation failed.

The k-CAS-n-CMP algorithm (3/3)

 let finish casn desired =
   match Atomic.get casn with
-  | Undetermined _ as current ->
+  | Undetermined cass as current ->
+    let desired =
+       if desired == After
+          && cass
+             |> List.exists @@ fun (CASS (loc, state)) ->
+                is_cmp casn state && Atomic.get loc != state then
+          Before
+       else
+          desired in
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After
   | determined -> determined == After

Note: The biggest change to the GKMZ algorithm is the addition of the post hoc verification step. Just before attempting to complete the transaction we go through all the CASS descriptors and check once more that the read-only comparisons are still valid.

Freedom!

let rec commit backoff mode xt =
  match attempt mode xt with
  | result -> result
  | exception Interference ->
      let backoff = Backoff.once backoff in
      commit backoff Mode.lock_free xt
  | exception Exit ->
      let backoff = Backoff.once backoff in
      commit backoff mode xt

Note: The actual implementation distinguishes between failure before and during verification. If a transaction attempt fails during verification * then the Interference exception is raised. This allows the commit loop to * switch to lock-free mode, which only uses CAS operations and requires no verification step. This guarantees that with commit transactions are lock-free. So, we don't actually lose anything!

Read-only updates

module Queue : Queue = struct
  type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
  let make () = { front = Loc.make []; back = Loc.make () )
  let enqueue q x = Tx.modify q.back (List.cons x)
  let try_dequeue q = Tx.(
    update q.front tl_safe >>= function
    | x :: _ -> return (Some x)
    | [] -> exchange_as List.rev q.back [] >>= function
              | [] -> return None
              | x :: xs -> set q.front xs >>. Some x
  )
end

Does try_dequeue perform read-only CMPs?

Note: Recall the queue implementation. If the queue is empty, the try_dequeue transaction may only perform read-only comparisons. Also, if the front is empty and the back contains only a single element, the front is not necessarily written. That is because * in those cases the try_dequeue transaction never writes a new value to those locations. The transaction mechanism is smart enough to optimize that to a read-only compare. This means that multiple domains can try to dequeue an element from an empty queue in parallel without interference. The conditionals "may" and "not necessarily" are there to account for the possibility of interference from writers.

5. Blocking transactions

✋

Note: What we have so far is probably about as far as kcas can be stretched without adding significant dependencies or overheads, but perhaps we could have a back door for an additional feature: blocking transactions. What is that?

Motivation

stm @@ let* x = Queue.dequeue qa in
       Queue.enqueue qb x

What if qa is empty?

Note: Consider a transaction that dequeues a value x from qa and enqueues it to qb. Seems straightforward. But what if qa is empty? Should we just keep on retrying in the hope that some other domain pushes an element into qa? That sort of busy-wait is usually a poor idea.

Try, try, try again!

let dequeue q =
  Queue.try_dequeue q >>= function
  | None ->
    retry ()
  | Some x ->
    return x

a new, modular form of blocking

Note: How might one even implement a dequeue operation that always returns a (non-optional) value? The idea here is that * retry suspends the transaction, waits until some location accessed during the transaction is mutated, and then restarts the transaction. This kind of retry is a key idea put forth in the Composable Memory Transactions paper.

Idea: Tagged locations

module Loc : sig
  type ('tag, 'a) tagged

  val make_tagged: 'tag -> 'a -> ('tag, 'a) tagged
  val get_tag : ('tag, 'a) tagged -> 'tag

  type 'a t = (unit, 'a) t

  (* ... *)
end

To associate waiters with a location.

Note: To support blocking, one essentially needs a way to signal waiters. After mutating some locations the mutator signals waiters. For a scalable mechanism that signal needs to be selective and only wake up those waiters that are interested in the mutated locations. To associate waiters with locations in a truly low-overhead fashion, one possibility would be to allow locations to be "tagged". In a blocking transaction mechanism that 'tag could be a bag of the waiters of changes to the location.

Idea: Limited access to log

module Tx : sig
  (* ... *)
  module Log : sig
    type t
    type 'r reduce = {one: 't 'a. ('t, 'a) Loc.tagged -> 'r;
                      zero: 'r; plus: 'r -> 'r -> 'r}
    val reduce : 'r reduce -> t -> 'r
  end

  exception Retry of unit t
  val reset_and_retry : (Log.t -> unit t) -> 'a t

  val with_written: 'a t -> (Log.t -> unit t) -> 'a t
end

Note: To make it easy for a blocking transaction mechanism to figure out which locations need to be waited for, limited access to the transaction log could be granted. * reset_and_retry returns a transaction that resets the current transaction such that it only reads from the accessed locations. The given function is then called with * the internal transaction log to * construct a transaction that is then composed after the current transaction. The composed transaction is then * raised as a Retry exception. * with_written returns a transaction that executes as the given transaction and then calls the given function with a view of the internal transaction log restricted to the written locations. The returned transaction is then composed after the whole transaction.

STM Sketch

type 'a t = 'a Tx.t

let retry () = Tx.reset_and_retry @@ fun log ->
  (* add waiters to all (tagged) locations in log *)

let rec stm waiters tx =
  match Tx.attempt @@ Tx.with_written tx @@ fun log ->
  (* move waiters from (tagged) locations to waiters list *)
  with
  | result -> (* signal all waiters and then *) result
  | exception Exit -> stm all_waiters tx
  | exception Tx.Retry add_waiters_tx -> (
    match Tx.attempt add_waiters_tx with
    | () -> (* do blocking wait and retry *) stm waiters tx
    | exception Exit ->
      (* locations were mutated before waiters were added *)
      stm tx)

let stm tx = stm (Loc.make []) tx

Note: Here is a quick sketch with some blanks left as an exercise of how one could implement STM with blocking transactions with the help of tagged locations, reset_and_retry and with_written. The key point here is the type of blocking STM transactions on the first line. The point is that the blocking STM would not need to add a layer on top of the lower level k-CAS transactions. This should basically mean that blocking transactions could be implemented with roughly half the performance of plain non-blocking transactions as every access potentially requires also touching the waiters list. I'm curious to find out how well this would perform against existing STM implementations!

Summary

Lock-free programming with CASes is hard 💦
Programming with Tx is easy by composi...arison
Thanks to
- careful low-level optimizations, splay tree,
- new k-CAS-n-CMP algorithm, and
- interference detection
Tx can give ok performance and lock-freedom

Tx is sweat-free concurrency!

Questions?

Note: Thank you for listening! I'm out of slides.

	@keyframes bounce {
	0% {
	left: 30%;
	}
	100% {
	left: -30%;
	}
	}

	@keyframes rotate {
	0% {
	transform: rotate(0deg);
	}
	100% {
	transform: rotate(360deg);
	}
	}

	@keyframes fonts {
	0% {
	/* font-size: 50%; */
	}
	100% {
	font-size: 100%;
	}
	}

	:hover > .bounce {
	position: relative;
	animation: bounce 2s ease-in-out infinite alternate,
	rotate 2s ease-in-out infinite, fonts 1s ease-in-out infinite alternate;
	}

	.reveal .hljs-ln-numbers {
	display: none;
	}

polytypic/slides.md

k-CAS for sweat-free concurrent programming

Contents

1. Lock-free programming with CAS and friends

Lock-free?

Treiber stack (1/2)

Treiber stack (2/2)

CAS causes sweating

If CAS doesn't work, you aren't using enough

Move between stacks using 2-CAS

2. Sweat-free programming with k-CAS Txs

What is a Tx?

The answer

More Tx combinators

A stack

A transactional stack (1/2)

A transactional stack (2/2)

Using a transactional stack

A transactional queue

The mother of learning

Because Txs compose (1/2)

Because Txs compose (2/2)

Is it a 🦊 or a 🐶 ?

Maybe it is worth giving a try

3. The lock-free GKMZ k-CAS algorithm

GKMZ?

The tables — I mean the types

Hint

Example

The GKMZ algorithm (1/3)

The GKMZ algorithm (2/3)

The GKMZ algorithm (3/3)

The real implementation

4. New obstruction-free k-CAS-n-CMP algorithm

Motivation

🤔 an Idea!

Obstruction-free

The k-CAS-n-CMP algorithm (1/3)

The k-CAS-n-CMP algorithm (2/3)

The k-CAS-n-CMP algorithm (3/3)

Freedom!

Read-only updates

5. Blocking transactions

Motivation

Try, try, try again!

Idea: Tagged locations

Idea: Limited access to log

STM Sketch

Summary

Questions?

2. Sweat-free programming with k-CAS `Tx`s

What is a `Tx`?

More `Tx` combinators

Because `Tx`s compose (1/2)

Because `Tx`s compose (2/2)