Kush, an introduction to schedulers

about me

I work for GDS
Cabinet Office
we started by building GOV.UK
- replaced older sites like direct gov, business link
we’re note just fixing websites
- we also build and run digital services
- working with depts across the country
- eg: register to vote
  - saw about 200k people register at the same time
visibility on govt services
I’m not here to talk about GDS
- go to Phil’s talk

this talk

it’s been hard to go to a conf in the last two years without hearing about docker (docker docker docker)
I want to demystify schedulers
- some people feel this is too much too soon
- mesos, kubernetes, etc
when you should use schedulers
- and when you shouldn’t
I see schedulers everywhere

what is it?

the job of a scheduler is to fairly allocate work
- fairness is an artistic term
- schedulers act fairly based on their defined objectives
example: schedule the expression E = AB + CD
- E = AB + CD
- E = (1*2) + (3*4) – explicit ordering
- tree of expressions
  - addition is commutative, so order doesn’t matter

common scheduling problems

unconstrained
time-constrained
- all ops completed in a time-bound
  - at least two steps
resource-constrained
- if we only have one multiplication unit

types of scheduling

ASAP (as soon as possible)
ALAP (as late as possible)
list scheduling
force-directed scheduling
- weighting (commonly time)

why?

we don’t make the best use of our infrastructure resources

example: linuxkernel cpu scheduler

decides which ops get CPU ultilization right now
challenge: make sure applications get a fair share of CPU time
- (called a “time slice”)
let’s look at linux 2.4 cpu scheduler
- task list containing all processes, and work to be completed
  - guarded by a r/w lock
  - each process has a priority
- it had a number of shortcomings
  - wasn’t efficient
    - had to iterate through whole task list every time
  - bad in SMP environments
    - could easily pin a single CPU and leave others unused
was rewritten
- remove global task list log
- each process gets its own queue and its own lock
rewritten again in 2.6
- called the Completely Fair Scheduler
- uses red/black tree instead of queues
scheduling algorithms:
- FF SCHED_FIFO
- RR SCHED_RR
- TS SCHED_OTHER (_NORMAL)
nice
- runs a program with modified priority
- eg: nice -n 19 echo “hello”
chrt or “change real time”
- chrt -p 16
- can change priority but also scheduling algorithm
ionice - set IO scheduling priority
- ionice -c3 -p89
  - sets process with pid 89 as an idle io process

i see schedulers

many queues, locks and a scheduling algorithm over the top
not just CPUs!
eg in networking (see www.lartc.org)
tc - traffice control
- eg: tc qdisc add dev lo root netem delay 500ms
  - sets queuing discipline
  - network delay of 500ms (netem == network emulation)
eg in postgres: has something called autovacuum
- automates execution of VACUUM and ANALYZE

container schedulers

Kubernetes:

type Scheduler interface
{
        Schedule(
                *api.Pod,
                ...
        )
}

Docker swarm
primitives
- CPU
- instances
- add your own: eg response time
ephemeral node
- with a service on it
- different nodes have different properties in the cloud
  - bad io
  - noisy neighbours
see @kelseyhightower’s talks too on container schedulers

others

baggage reclaim
- a bunch of queues
- a lock
- you are the scheduling algorithm - you know what you care about which is your bag
- baggage reclaim is controlled by a software scheduler
premier league football matches
- domain-specific rules
  - man city and man u can’t play at home on the same day
    - (infrastructure can’t cope)
  - nottingham trent bridge
    - three stadiums next to each other
    - can’t all operate at the same time
people
- pomodoro
- work planning
you can choose to hand over knowledge work to schedulers (but they’re not a silver bullet)
- can probably do better capacity planning

Amir Chaudhry: Unikernels and hyper-elastic clouds

who’s heard of unikernels?
who’s actually used them? which ones?
- halvm
- mirage
- rump kernel
talk resources

motivation

software today
- is built locally, but deployed somewhere else
- is complex! (even though most apps are single-purpose)
complexity is the enemy

MirageOS

OCaml
- it has a good module system

Matthew Podwysocki, Putting You Back in Charge of Your Data

Zach Tellman, Everything Will Flow

https://twitter.com/ztellman

Queues!

they’re simple, right?
- producer -> queue -> consumer
they’re everywhere
java.util.concurrent.Executor
- thread pool
- BlockingQueue on requests
java.util.concurrent.BlockingQueue
- producer -> [notFull] [buffer] [notEmpty] -> consumer
java.util.concurrent.locks.Condition
- software threads
- hardware threads
- more queues to schedule the threads
queues are everywhere!

why?

queues separate the “what” from the “when”

queueing theory

branch of statistics, not software!
Performance modeling and design of computer systems

closed systems

one-at-a-time
shell prompt
- command
- output
- command
- output
- …
single person browsing a website
- click link
- wait for page to load
- click link
- wait for page to load
- …

open systems

requests come in
- while we’re serving one, another can come in at any point

simulating open systems

producers: exponential distribution
- it’s memoryless – the amount of time passed doesn’t change likelihood of producers arriving
consumers: pareto distribution
simulation examples
- failure mode: load increases until queueing time dominates
- the more consumers you add, the longer it takes to fall over
  - …but the more suprising it is when it does
  - echos of “why complex systems fail” here
lesson: unbounded queues are fundamentally broken
we’ve made assumptions about how users behave, but haven’t told them, or held them to it

dealing with too much data

options:
1. drop the data
2. reject the data
3. pause the data
Drop
- only valid if newer data obsoletes older data, or if we just don’t care
  - eg statsd uses UDP by default
    - problem: when we’re most in need of high fidelity data, we’re least likely to have it, because UDP is best-effort and will degrade under load
Reject
- eg twitter fail whale
- often, this is the only choice available
Pause (aka backpressure)
- often the only choice for a close system, library, or subsystem
- backpressure swims upstream
  - back to java BlockingQueue:
  - producer -> [notFull] [buffer] [notEmpty] -> consumer
    - the extra pieces are for backpressure reasons
    - you don’t even need the buffer here!

unbuffered queues

producer -> [puts] [takes] -> consumer
unbuffered queues are closed systems
why have buffers if we don’t need them?
- because backpressure isn’t free
  - restarting threads
  - restarting TCP traffic
  - reading from disk
  - reading from dbs
- a buffer allows higher, more consistent throughput
  - at a cost of potentially higher latency, because things sit on the buffer waiting to be dealt with

lessons

we should always plan for too much data
backpressure
TCP bufferbloat - example of problem of too many composed buffers

concrete example

ma: data system that takes data and takes it to a safe at-rest place
arbitrary volume of data -> ELB -> [thing] -> S3
- what goes in the middle?
  - must horizontally scale
  - must be low-maintenance
  - not real-time
  - loss of data ∝ loss of utility
- used:
  - aleph -> hadoop s3 client -> s3
- but then:
  - S3 starts rejecting writes
  - hadoop starts accumulating writes waiting for S3 to come back
replaced hadoop s3 client with:
- durable-queue and s3-journal
still had problems:
- still acquired data in durable-queue without bound
- no visibility when this happened; no metrics
- had to start writing metrics for this
netty’s threading model
- our typical metrics measure from when we start actually processing the request
  - but this ignores the queueing delay!
  - threadpools are not well instrumented
dirigiste for better understanding of behaviour of thread pools
- distinguish task latency from queue latency
- better metrics
  - not complete – they can’t be
  - whole system includes client

lessons

be clear what “completion” is
be picky about your tools
- watch for sneaky unbounded queues!
prefer metrics to raw performance
you’re never measuring the complete system

publishing

eg chat server
one-to-many queue
timeouts

we talked about queues

unbounded queues aren’t. they’re bounded by something we’ve failed to define.

Q&A

how do you deal with pain from low defaults (eg in haproxy?)
- you need some documentation on the knobs are and why you should care
- a one-size-fits-all parameter is bound to disappoint
with the log-latency graphs you showed with the buffered queue example, is that a useful visibility tool for day-to-day systems we use?
- that’s my general intuition
- most people use graphite
- doing distributions in graphite is hard
if we follow your example of using buffered queues, were there any metrics you feel we should definitely be tracking?
- utilization & contention
  - once you get above 60-70% utilization, exhaustion gets quite likely
queueing theory often assumes a fixed number of consumers; can it work in an autoscaling world?
- yes
- maybe cloud services resembles infinite consumers?
- can you do formal analysis of a dynamic system like this and predict exactly how it will behave? probably not
  - better to test, benchmark, see what it does under load

Evan Czaplicki, Accidentally Concurrent

works for Prezi to work on the Elm language
I care a lot about how proglang research from last 30 yrs can be useful day-to-day on the web
- eg: why is immutability good?

message passing concurrency

I take this from concurrent ML
lightweight threads
basic primitives:
- channel: () -> Channel a
- send: Channel a -> a -> ()
- recv: Channel a -> a
- spawn: (a -> ()) -> a -> ThreadId

mutation

what does mutation look like when you model it as a concurrent system with these primitives?
- type Msg a = Get | Set a
- “two spies in the park” approach to communication
- a new thread per variable
  - becomes unmanagable

encapsulation

function BadGuy() {
    var x = 0;
    var y = 0;
    var health = 100;
    this.follow = function(hero) { ... };
    this.damage = function(n) { ... };
}

hides multiple mutable cells behind a single interface
introduces hierarchy
but: hierarchy ≠ modularity
- reference can be passed from one code unit to another
  - introduces extra comm lines between subtrees of hierarchy
- getters and setters on objects
- setters mean readers have ability to bash on objects

schedulers

different kinds
- cooperative
  - runs a thread until it’s done or it yields
  - used by golang, …
- preemptive
  - you can bail on something half-way through
  - used by erlang, concurrent ML, …
that old thing concurrency ≠ parallelism
event loop
- “a poor man’s concurrency”
- javascript setTimeout(callback, 0); is the same as spawn callback()
“Reactive” Programming
- “everything is a stream!”
  - type alias Stream a = Channel a
  - build things up with combinators
  - map: (a->b) -> Stream a -> Stream b
- glitches
  - diamond communication shape

diamond : Stream Float -> Stream Bool
diamond stream =
  merge
    (==)
    (map sqrt stream)
    (map sqrt stream)

glitches cont’d
- looked at as a concurrent system, it’s clear that this will cause glitches
  - there’s no reason to expect synchronization
- if you thought of your streams as a concurrent system, you probably wouldn’t write it this way in the first place
FlatMap is weird!
- flatMap : (a -> Stream b) -> Stream a -> Stream b
- it is kind of like a threadpool, but really bizarrely done
- is it even a reasonable use of flatmap?
dependencies between streams become a rat’s nest
- hero depends on keyboard
  - hero’s health depends on location, bad guys, etc
- bad guys depend on hero
  - want to chase him
- pause screen depends on
  - hero’s health
  - keyboard, mouse
- hero depends on pause screen
  - maybe I can apply a potion to increase health
- this is crazy

big points

writing high-quality concurrent programs is hard
everyone is writing accidentally concurrent programs

better stuff

immutability
- means no accidental communication channels
- revealing information doesn’t change your architecture

Q&A

what about Rust’s approach to mutability where it just emphasizes ownership?
- ownership and encapsulation are separate concerns

Leah Hanson, How Julia Goes Fast

@astrieanna
http://blog.leahhanson.us
julia is a programming language
- “as fast as C” to “2x slower than C”

what problem is julia solving?

julia is for scientists
- engineers, mathematicians, finance people
- non-professional programmers who use programming as a tool
  - least time programming for maximum benefit
what do they need?
- easy to learn
- easy to use
- good for scripts
- good for libraries
- fast enough for medium to large data sets
- fast, extensible math
  - linear algebra
  - new numeric types
easy and fast
how is it better than what they already use? (eg numpy)
- need to learn python
- need to add your own function that doesn’t yet exist
  - implement in python
    - too slow
    - (numpy’s things are written in C with python wrappers)
    - learning C is not a good use of our users’ time
fast julia code is written in julia

the big decisions

dynamic, interpreted, easy

background for implementation

JIT compilation
- at run time, you alternate between compiling and running code
compiler needs to be fast
- static compilers can be slow because they don’t contribute to runtime
- this is no longer true for JITs
compiler has access to runtime information
julia is designed up front to be JIT compiled
- type system
  - abstract or concrete types
    - concrete types:
      - can be instantiated
      - determine mem layout
      - types can’t be modified after creation
      - can’t have subtypes
      - single supertype
    - abstract types:
      - can’t instantiate
      - new subtypes can be added at any time
      - single supertype
      - zero or more subtypes
multiple dispatch
- convenience for extensible math
- single dispatch is method dispatch we’re familiar with
  - x.foo(y) – code impl depends on type of x
- multiple dispatch can depend on any param:
  - foo(x,y) – code impl depends on types of x and y
- all named fns are generic
- each fn has one or more methods

x = ModInt(3,5)
x + 5 # ModInt + Int
5 + x # Int + ModInt

function Base.+(m::ModInt, i::Int64)
    return m + ModInt(..)
end

this just can’t be done nicely with single dispatch

the details of how things go fast

dispatch

dispatch happens a lot, so needs to be really fast
- foo(5) #dispatch, compile, run
- foo(6) #dispatch, run
- foo(7) #dispatch, run
possible signatures:
- foo(Int64)
- foo(Number)
- foo(Float64)
- foo(Int64,Int64)
do we need to keep redispatching? aren’t we always going to pick foo(Int64)?
- add a cache
- foo(5) #cache, dispatch, compile, run
- foo(6) #cache, run
- foo(7) #cache, run
- call into hashmap rather than going through a list

generic functions

define Base.* in terms of +(Number,Number) and -(Number,Number)
C++: templates vs dynamic dispatch
- dynamic dispatch is generic
  - single Base.* impl to cover all Number instances
- templates do aggressive specialization
  - individual Base.* impl for each Number instance
- tradeoff: code size vs speed
inlining
- makes more optimizations available
- not reusable
devirtualization
- write down the IP to avoid DNS
- ie hardcode the method implementation rather doing dispatch
why do we normally dispatch?
- we’re avoiding dispatch as much as possible for performance
- worth knowing why we have it in the first place
  - repl work
    - redefine a function to fix a bug, want callers to reflect changes
    - issue 265

type stability

non-concrete types have to be boxed
immutable types
- kind of concrete type
- once created, value can’t change
boxed immutable values are bad for perf
- immutable types can’t appear to change
- we have to copy them each time we want to change them
- new heap allocation

macros for speed?

julia has lisp-style macros
macros are evaluated at compile time
macros should be used sparingly
Horner’s rule
- a*x*x + b*x + c
- skip a multiply with (a*x + b)*x + c
- for a cubic: 6 multiplies down to 3
- for a quartic: 10 multiplies down to 4
- etc (O(n^2) down to O(n))
Horner’s rule can be implemented as a macro
- taken from PR 2987
4x faster than matlab
3x faster than scipy
- both of which call C/Fortran libs
Julia is faster than C!
- compiled julia methods will have inlined constants, which are very optimizable
- C would involve a run-time loop over the array of coeffs
- LLVM can optimize the julia better than the C
- an abstraction which makes things faster

conclusion

scientists like easy & fast
dynamic for easy (but with limits)
compiled for fast
multiple dispatch for making fast code concise

Q&A

what about R?
- R is slow for things that need loops (like monte carlo)
does julia have support for openmp?
- I think so?
- you can call libs that use OpenMP
- there’s experimental support for OpenMP within julia

John Hughes, Mary Sheeran, Why Functional Programming Matters

FP à la 1940s

who needs booleans?
a boolean just makes a choice! They can just be functions!
- true x y = x
- false x y = y
we can then define if-then-else
- ifte bool t e = bool t e
who needs integers?
- a (positive) integer just counts loop iterations
  - two f x = f (f x)
  - one f x = f x
  - zero f x = x
- to recover a “normal” integer…
  - two (+1) 0
    - gives 2!
- add m n f x = m f (n f x)
- mul m n f x = m (n f) x
factorial!

fact n =
  ifte (iszero n)
    one
    (mul n (fact decr n))

This means: booleans, integers, (and other data structures) can be entirely replaced by functions!

Church encodings

Before you try this at home…

“cannot construct the infinite type…”

~fact: (forall a. (a->a) -> a -> a)…~

Factorial à la 1960

LISP!
This got lots of people excited!
- esp Peter Landin (see The Next 700 Programming Languages and ISWIM)

laws

(maplist f (reverse l)) ≡ (reverse (maplist f l))
- …only if f is side-effect free

issues with conventional languages

John Backus, “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”
- Turing award 1977; paper 1978
“Inherent defects at the most basic level cause them [conventional languages] to be both fat and weak”
defects such as:

word-at-a-time operation

inherited from the von Neumann machine
- the von Neumann way of designing machines has a terrible effect on writing programs
- don’t focus on the word-at-a-time bottleneck; focus on whole values

inability to use “combining forms”

what we would today call “higher order functions”
he called map f “αf”

lack of useful mathematical properties for reasoning about programs

[f1,f2,..,fn] ∘ g ≡ [f1 ∘ g, f2 ∘ g,…, fn ∘ g]
g ∘ [f1,f2,..,fn] ≡ [g ∘ f1, g ∘ f2,…, g ∘ fn]

Got a bit APLy:

Def IP = (/ +) ∘ (α ×) ∘ Trans

Peter Henderson

functional geometry
escher’s square limit fish
devised an algorithmic definition of escher’s picture
- operators on pictures such as rot, over, above
- higher-order operators cycle, quartet
then: define pictures as functions
- p(a,b,c) is the picture at position a, with height along vector b and width along vector c
- other combinators become easy to describe here:
  - eg over (p,q) (a,b,c) becomes p(a,b,c) ∪ q(a,b,c)
this functional geometry avoids Backus’s defects:
- combining forms
- operations on whole values
- algebra with laws
and as a bonus: functions as representations
taking it further: Conal Elliot’s Pan

FP à la 1990s

Haskell: Paul Hudak got DARPA money to work on Haskell as a prototyping language
- Haskell vs Ada vs C++ vs awk…
functions as data
- type Region = Point -> Bool
  - predicate representation of set of points
reactions…
- “too cute for it’s own good”
- “higher-order functions just a trick, probably not useful in other contexts”

Lazy evaluation

Henderson and Morris, A lazy evaluator
Friedman and Wise, CONS should not evaluate its arguments
“The Whole Value” can be infinite!
- the infinite list of natural numbers [0, 1, 2, 3, …]
- all the iterations of a function
- iterate f x = [x, f x, f (f x), ...]
some things can be expressed very nicely with lazy eval
- eg newton-raphson, numerical derivative

my paper

John Hughes, Why Functional Programming Matters
lazy producer-consumer pattern
- producer: successive approximations of a numerical algorithm (eg derivative, integral)
  - consumer: limit detection
- producer: search space
  - consumer: search strategy
pretty printing
- make choices between horizontal and vertical layout
  - vertical: preserve indentation
  - horizontal: discard indentation
- combinators: a $$ b puts a above b; a <> b puts a beside b
- combinators obey laws:
  - <> and $$ are both associative
- once we understand the laws, the implementation can use the laws to make the best layout

my other paper

Koen Claessen, John Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”
randomly generate test cases from properties (properties based on laws)
then shrink failing test cases to minimal counterexamples
This is another lazy producer-consumer!
- producer: space of all possible tests
- consumer: QuickCheck search strategy

more applications

Introduction to VLSI systems
- was intended to revolutionize programming
- did revolutionize hardware design
A VLSI circuit is not a million miles away from Escher’s square limit!
- we used functional geometry ideas to design VLSI circuits
- we need to be certain we haven’t made a mistake as the feedback cycle from the fab was too slow (9 months)
use Backus’s FP to create a language that describes streaming circuits
- then use algebraic laws of the language to optimize the circuits
- symbolic simulator to test
- muFP
hardware sorting based on doubly-recursive definition:
- http://www.cs.kent.edu/~batcher/sort.pdf
having a language of combining forms makes it much easier to reason about sorting networks
- though there are still outstanding unsolved problems
Satnam Singh: Lava language http://blog.raintown.org/p/lava.html
Pentium float division bug
- very costly mistake
- they suddenly got interested in formal verifiction
- forte
bluespec bsv
- bluecheck
  - QuickCheck in hardware design!
  - iterative deepening and shrinking on FPGA designs!

four key ideas

functions as representations
whole values
simple laws
combining forms

Martin Thompson, A Quest for Predictable Latency

subtitle: Adventures in Java Concurrency
@mjpt777
latency: responding in a timely manner
the more things you try to juggle, the harder it becomes to predict how responsive you can be
if you don’t respond in a timely manner, you’re effectively unavailable

causes of blocking

concurrent algorithms
- notifying completion
- mutex
- synchronization/rendezvous
systemic pauses
- JVM Safepoints (GC etc)
- Transparent Huge Pages (linux)
- Hardware

concurrent algorithms

two approachs: locks and CAS
- both are hard to get right!

what is latency?

queueing theory
- response time is sum of:
  - enqueue time
  - latent time (time in queue)
  - dequeue time
  - service time (time being processed)
queueing theory focuses on latent time and service time, but sometimes enqueue and dequeue dominate!
relationship between utilization and increase response time

adventures with locks and queues

evils of blocking
condition variables
- condition variable RTT (echo service)
- request: ping; response: pong
- measure histogram of response times
  - don’t use averages! they lie!
- 90th percentile: 8 µs
- 8 µs is loads of time to talk between two threads!
  - 2 µs is enough to go between two machines on the same network!
- but it gets worse at higher percentiles
  - max is 5525 µs!
bad news: this is the best case scenario!
context for benchmarks:
- https://github.com/real-logic/benchmarks
- java 8u60
- quad core machine
producers: 1, 2 or 3 (can’t have more on a quad-core machine!)
measuring: mean and 99th percentile
baseline (wait-free) vs ArrayBlockingQueue vs LinkedBlockingQueue vs ConcurrentLinkedQueue
burst length: 1 then 100
- real world traffic is bursty
- send a burst of 100; wait for 1 ack
other considerations:
- Backpressure (queues should have bounded size)
- size methods
- flow rates
- garbage
- fan out
in the real world, I cannot use java’s built-in queues, partly because they lack features I need
- if you call .size() on ConcurrentLinkedQueue, it walks the whole list

some alternative FIFOs

the disruptor

get over the garbage problem
- preallocate a ring buffer and reuse it
producers claim a slot using CAS
- once a slot is claimed, can write to the slot
- once written, can update the cursor
consumer updates gating
no need for locks!
wrinkle: how do I deal with setting the cursor?
- I can’t update cursor to n before it has reached n-1
- thread claims slot; gets swapped out before writing to slot
  - blocks whole queue until thread swaps back in (many ms!)

update in Disruptor 3.0

cursor is used as CAS loop
add an “available” array to mark when things are written
made a huge difference

what if I just need a queue?

disruptor isn’t best used as a queue (better for coordinating graph of dependencies)
ManyToOneConcurrentArrayQueue (influences: Lamport + Fast Flow)
link?
tail: CAS counter
head: counter
reference ring buffer

back to the benchmarks

Disruptor and ManyToOneConcurrentArrayQueue behave better in low contention
gets worse under the bursty case
each over 50 µs at this stage
spinning CAS loops are the great equalizer

inter-process FIFOs

ring buffer
spinning CAS on tail counter
write message once space claimed
write header to indicate completion
- (zero out memory on consumption to make this work)
MPSC Ring Buffer
latency is good but throughput takes a hit from zeroing memory

Aeron IPC

http://highscalability.com/blog/2014/11/17/aeron-do-we-really-need-another-messaging-system.html
we wanted a CRDT
- replicated to another machine
- may be reordered or duplicated
- want to reassemble deterministically
big shared-memory file
- tail pointer (but no head this time)
- advance tail, write message, write header (just as above)
does the file really just go on forever?
- problems with this: page faults; page cache churn; VM pressure
- wash one, wear one, dry one
- active file, dirty file, clean file
XADD instruction
- available in JVM now
- ring buffers can’t use this because the tail can overtake the head
- if you move the tail beyond the end of the buffer, you can detect this
  - your responsibility is to rotate the buffer now
- messages persist while active & dirty buffers are around
append-only data structures can be safe to read without locks
- they can detect when they’ve reached the end
zeroing problem from before: how do we avoid the throughput problem?
- do it in a background thread!
- the old “dirty” becomes the new “clean”

back to the benchmarks!

aeron IPC is slower than ManyToOneConcurrentArrayQueue in the happy case (no burst, one producer, average time)
for bursty, 3 producers, 99th percentile:
- ManyToOneConcurrentArrayQueue breaks below 50µs to 34µs
  - less “False Sharing”
  - inlined data vs Reference Array
    - less “Card Marking”
- Aeron gets even further down to 13µs (ish?)
  - also mean is 10µs so this is phenomenal predictability
  - avoids the spinning CAS

logging is a messaging problem!

the abominations we have in Java for logging are disgusting!
- the APIs, the design, etc
what design should we use?

where can we go next

C vs Java
- Spin loops
- modern processors do out-of-order speculation
- with spin loops, they speculate wrong
  - Proposed: Thread.spinYieldHint()
- data dependent loads
  - heap aggregates: objectlayout
  - stack allocation - value types
- memory copying
  - baseline against memcpy() for differing alignment
- queue interface
  - break conflated concerns to reduce blocking actions
  - API design mistakes
    - conflates “is something available?” with “is it empty?”
    - should not have used normal collections API for concurrent stuff

conclusion

AWS recently announced x1 instances
- 2TB RAM + 100+ vcores!
- lots of spinning wheels

Q&A

can you tell us more about XADD in java?
- AtomicInteger, AtomicLong, etc

Martin Kleppmann, Transactions: Myths, Surprises and Opportunities

“Consistency guarantees? hahaha” (cue manic laughter)
I’m writing a book!
@martinkl

history

1975: IBM System R
2008? NoSQL
2012? “NewSQL”
- eg SQL on top of HBase
really a movement of transactions-or-not

ACID!

Härder & Reuter, 1983
“More mnemonic than precise” - Brewer, 2012

D for Durability

archive tape?
fsync to disk?
replication?

C for Consistency

not the same as C in CAP Theorem!
- aside: critique of cap theorem
“tossed in to make the acronym work”

A for Atomicity

not the same as in “atomic compare-and-swap”
not about concurrency
it’s about fault tolerance
rolls back writes on abort
better word might have been Abortability
aborts are super helpful because they collapse a whole bunch of error classes:
- crashes, power failures
- constraint violation
- network fault
- deadlock

I for Isolation

this is where we start talking about concurrency
= SERIALIZABLE?
in practice, multiple isolation levels
- Repeatable Read
- Snapshot Isolation
- Read Committed
- Read Uncommitted
can anyone here describe the difference of the top of their head? (one hand goes up)
this is a configurable property of many databases
- though many don’t go as high as Serializable
- and some call it Serializable but are actually weaker than this

isolation levels

preventing dirty reads and dirty writes

read committed prevents these anomalies
default in postgres, oracle, …
that’s enough for transactions, right?
what are the other weird isolation levels?

preventing read skew

this can happen under read committed!
concurrent transaction:
1. - add 100 to x
  - subtract 100 from y
  - commit
2. - read y
  - read x
reader can see old value of y, but new value of x
Repeatable Read and Snapshot Isolation were invented to prevent this
Traditional RR implementation is based on locking
SI is used more often now to avoid locks
- poorly named in most DBs

write skew

invariant: at least one doctor on call at all times
doctors can trade shifts
transaction:
- count how many doctors are on call
- if >=2:
  - update
- commit
but: two txns operate concurrently for two different doctors:
- both see n >= 2
- both remove themselves from rota
- invariant violated!
pattern:
- measure
- decide
- write
- commit
- by the time the write is committed, the premise of the decision is no longer true!
solutions
- two-phase locking (pessimistic)
  - shared lock on measure step
    - block entire table for writes, whenever you make a large read
  - poor for perf
  - motivation for weaker isolation levels
  - System R team just didn’t hold locks for so long
    - this is what created their crazy isolation levels
- H-Store (Stonebraker et al)
  - VoltDB, datomic
  - literally execute your txns in a serial order
    - trivally correct
    - as long as you make your txns fast enough!
      - otherwise your throughput will be abysmal
- SSI (Cehill Röhm & Fekete)
  - Postgres
  - detect conflicts & abort
    - optmistic counterpart to pessimistic 2PL
  - Idea: locks like in 2PL
    - but locks don’t block, just gather information

what about other distributed systems?

microservices?
stream processing?
serializability across services?
- atomic commitment
  - 2PC, 3PC
- total ordering
- consensus
  - paxos, raft, zab
strong connection between all of these things
- coordination
- failure amplification
geographic distribution makes this worse
distributed txns don’t work particularly well

without cross-service transactions

so where are we now?
compensating transactions
- ≅ abort/rollback at app level
- looks a bit like ACID A
apologies
- detect & fix contraint violations after the fact, rather than preventing
  - overselling beyond stock level; issue voucher instead
- looks a bit like ACID C

every sufficient large deployment of microservices contains an ad-hoc informally-specified bug-ridden, slow implementation of half of transactions

ordering of events
- unfriend
- post nastygram to remaining friends
friends service and posts service are separate services
notifications services pushing out emails etc, reading from posts and friends services
reordering due to delays
notification goes to old set of friends, not new set of friends. oops!
implicit causal relationships leading to violated expectations

current research

things between strict serializability and eventual consistency
causality:
- partial order of reads/writes
- can be maintained without global coordination
- “consistent snapshot” in SI
  - consistent with causality
- but also efficient?
isolation levels are divided into those that require coordination and those that don’t

references

four slides’ worth!

Q&A

how did we end up with this mess? naming etc
- on the academic side, the database community and distributed systems community have historically been completely separate
  - evolved separate language for similar concepts
- wouldn’t have many of current problems if they had talked to each other in 1980s
- SQL standard had problems with isolation levels
  - failed to describe them in an implementation-neutral way
  - this is why Postgres uses SI but calls it REPEATABLE READ
also see hermitage, a tool to test what kind of anomalies different isolation levels of dbs allow

cicorias/codemesh2015.org

Kush, an introduction to schedulers

about me

this talk

what is it?

common scheduling problems

types of scheduling

why?

example: linuxkernel cpu scheduler

i see schedulers

container schedulers

others

Amir Chaudhry: Unikernels and hyper-elastic clouds

motivation

MirageOS

links

Matthew Podwysocki, Putting You Back in Charge of Your Data

Zach Tellman, Everything Will Flow

Queues!

why?

queueing theory

closed systems

open systems

simulating open systems

dealing with too much data

unbuffered queues

lessons

concrete example

lessons

publishing

we talked about queues

Q&A

Evan Czaplicki, Accidentally Concurrent

message passing concurrency

mutation

encapsulation

schedulers

big points

better stuff

Q&A

Leah Hanson, How Julia Goes Fast

what problem is julia solving?

the big decisions

background for implementation

the details of how things go fast

dispatch

generic functions

type stability

macros for speed?

conclusion

Q&A

John Hughes, Mary Sheeran, Why Functional Programming Matters

FP à la 1940s

Before you try this at home…

Factorial à la 1960

laws

issues with conventional languages

word-at-a-time operation

inability to use “combining forms”

lack of useful mathematical properties for reasoning about programs

Peter Henderson

FP à la 1990s

Lazy evaluation

my paper

my other paper

more applications

four key ideas

Martin Thompson, A Quest for Predictable Latency

causes of blocking

concurrent algorithms

what is latency?

adventures with locks and queues

some alternative FIFOs

the disruptor

update in Disruptor 3.0

what if I just need a queue?

back to the benchmarks

inter-process FIFOs

Aeron IPC

back to the benchmarks!