- I work for GDS
- Cabinet Office
- we started by building GOV.UK
- replaced older sites like direct gov, business link
- we’re note just fixing websites
- we also build and run digital services
- working with depts across the country
- eg: register to vote
- saw about 200k people register at the same time
- visibility on govt services
- I’m not here to talk about GDS
- go to Phil’s talk
- it’s been hard to go to a conf in the last two years without hearing about docker (docker docker docker)
- I want to demystify schedulers
- some people feel this is too much too soon
- mesos, kubernetes, etc
- when you should use schedulers
- and when you shouldn’t
- I see schedulers everywhere
- the job of a scheduler is to fairly allocate work
- fairness is an artistic term
- schedulers act fairly based on their defined objectives
- example: schedule the expression E = AB + CD
- E = AB + CD
- E = (1*2) + (3*4) – explicit ordering
- tree of expressions
- addition is commutative, so order doesn’t matter
- unconstrained
- time-constrained
- all ops completed in a time-bound
- at least two steps
- all ops completed in a time-bound
- resource-constrained
- if we only have one multiplication unit
- ASAP (as soon as possible)
- ALAP (as late as possible)
- list scheduling
- force-directed scheduling
- weighting (commonly time)
- we don’t make the best use of our infrastructure resources
- decides which ops get CPU ultilization right now
- challenge: make sure applications get a fair share of CPU time
- (called a “time slice”)
- let’s look at linux 2.4 cpu scheduler
- task list containing all processes, and work to be completed
- guarded by a r/w lock
- each process has a priority
- it had a number of shortcomings
- wasn’t efficient
- had to iterate through whole task list every time
- bad in SMP environments
- could easily pin a single CPU and leave others unused
- wasn’t efficient
- task list containing all processes, and work to be completed
- was rewritten
- remove global task list log
- each process gets its own queue and its own lock
- rewritten again in 2.6
- called the Completely Fair Scheduler
- uses red/black tree instead of queues
- scheduling algorithms:
- FF SCHED_FIFO
- RR SCHED_RR
- TS SCHED_OTHER (_NORMAL)
nice
- runs a program with modified priority
- eg: nice -n 19 echo “hello”
chrt
or “change real time”- chrt -p 16
- can change priority but also scheduling algorithm
ionice
- set IO scheduling priority- ionice -c3 -p89
- sets process with pid 89 as an idle io process
- ionice -c3 -p89
- many queues, locks and a scheduling algorithm over the top
- not just CPUs!
- eg in networking (see www.lartc.org)
tc
- traffice control- eg:
tc qdisc add dev lo root netem delay 500ms
- sets queuing discipline
- network delay of 500ms (netem == network emulation)
- eg:
- eg in postgres: has something called autovacuum
- automates execution of VACUUM and ANALYZE
- Kubernetes:
type Scheduler interface
{
Schedule(
*api.Pod,
...
)
}
- Docker swarm
- primitives
- CPU
- instances
- add your own: eg response time
- ephemeral node
- with a service on it
- different nodes have different properties in the cloud
- bad io
- noisy neighbours
- see @kelseyhightower’s talks too on container schedulers
- baggage reclaim
- a bunch of queues
- a lock
- you are the scheduling algorithm - you know what you care about which is your bag
- baggage reclaim is controlled by a software scheduler
- premier league football matches
- domain-specific rules
- man city and man u can’t play at home on the same day
- (infrastructure can’t cope)
- nottingham trent bridge
- three stadiums next to each other
- can’t all operate at the same time
- man city and man u can’t play at home on the same day
- domain-specific rules
- people
- pomodoro
- work planning
- you can choose to hand over knowledge work to schedulers (but
they’re not a silver bullet)
- can probably do better capacity planning
- who’s heard of unikernels?
- who’s actually used them? which ones?
- halvm
- mirage
- rump kernel
- talk resources
- software today
- is built locally, but deployed somewhere else
- is complex! (even though most apps are single-purpose)
- complexity is the enemy
- OCaml
- it has a good module system
- they’re simple, right?
- producer -> queue -> consumer
- they’re everywhere
- java.util.concurrent.Executor
- thread pool
- BlockingQueue on requests
- java.util.concurrent.BlockingQueue
- producer -> [notFull] [buffer] [notEmpty] -> consumer
- java.util.concurrent.locks.Condition
- software threads
- hardware threads
- more queues to schedule the threads
- queues are everywhere!
- queues separate the “what” from the “when”
- branch of statistics, not software!
- Performance modeling and design of computer systems
- one-at-a-time
- shell prompt
- command
- output
- command
- output
- …
- single person browsing a website
- click link
- wait for page to load
- click link
- wait for page to load
- …
- requests come in
- while we’re serving one, another can come in at any point
- producers: exponential distribution
- it’s memoryless – the amount of time passed doesn’t change likelihood of producers arriving
- consumers: pareto distribution
- simulation examples
- failure mode: load increases until queueing time dominates
- the more consumers you add, the longer it takes to fall over
- …but the more suprising it is when it does
- echos of “why complex systems fail” here
- lesson: unbounded queues are fundamentally broken
- we’ve made assumptions about how users behave, but haven’t told them, or held them to it
- options:
- drop the data
- reject the data
- pause the data
- Drop
- only valid if newer data obsoletes older data, or if we just
don’t care
- eg statsd uses UDP by default
- problem: when we’re most in need of high fidelity data, we’re least likely to have it, because UDP is best-effort and will degrade under load
- eg statsd uses UDP by default
- only valid if newer data obsoletes older data, or if we just
don’t care
- Reject
- eg twitter fail whale
- often, this is the only choice available
- Pause (aka backpressure)
- often the only choice for a close system, library, or subsystem
- backpressure swims upstream
- back to java BlockingQueue:
- producer -> [notFull] [buffer] [notEmpty] -> consumer
- the extra pieces are for backpressure reasons
- you don’t even need the buffer here!
- producer -> [puts] [takes] -> consumer
- unbuffered queues are closed systems
- why have buffers if we don’t need them?
- because backpressure isn’t free
- restarting threads
- restarting TCP traffic
- reading from disk
- reading from dbs
- a buffer allows higher, more consistent throughput
- at a cost of potentially higher latency, because things sit on the buffer waiting to be dealt with
- because backpressure isn’t free
- we should always plan for too much data
- backpressure
- TCP bufferbloat - example of problem of too many composed buffers
- ma: data system that takes data and takes it to a safe at-rest place
- arbitrary volume of data -> ELB -> [thing] -> S3
- what goes in the middle?
- must horizontally scale
- must be low-maintenance
- not real-time
- loss of data ∝ loss of utility
- used:
- aleph -> hadoop s3 client -> s3
- but then:
- S3 starts rejecting writes
- hadoop starts accumulating writes waiting for S3 to come back
- what goes in the middle?
- replaced hadoop s3 client with:
- still had problems:
- still acquired data in durable-queue without bound
- no visibility when this happened; no metrics
- had to start writing metrics for this
- netty’s threading model
- our typical metrics measure from when we start actually
processing the request
- but this ignores the queueing delay!
- threadpools are not well instrumented
- our typical metrics measure from when we start actually
processing the request
- dirigiste for better understanding of behaviour of thread pools
- distinguish task latency from queue latency
- better metrics
- not complete – they can’t be
- whole system includes client
- be clear what “completion” is
- be picky about your tools
- watch for sneaky unbounded queues!
- prefer metrics to raw performance
- you’re never measuring the complete system
- eg chat server
- one-to-many queue
- timeouts
- unbounded queues aren’t. they’re bounded by something we’ve failed to define.
- how do you deal with pain from low defaults (eg in haproxy?)
- you need some documentation on the knobs are and why you should care
- a one-size-fits-all parameter is bound to disappoint
- with the log-latency graphs you showed with the buffered queue
example, is that a useful visibility tool for day-to-day systems
we use?
- that’s my general intuition
- most people use graphite
- doing distributions in graphite is hard
- if we follow your example of using buffered queues, were there
any metrics you feel we should definitely be tracking?
- utilization & contention
- once you get above 60-70% utilization, exhaustion gets quite likely
- utilization & contention
- queueing theory often assumes a fixed number of consumers; can it
work in an autoscaling world?
- yes
- maybe cloud services resembles infinite consumers?
- can you do formal analysis of a dynamic system like this and
predict exactly how it will behave? probably not
- better to test, benchmark, see what it does under load
- works for Prezi to work on the Elm language
- I care a lot about how proglang research from last 30 yrs can be
useful day-to-day on the web
- eg: why is immutability good?
- I take this from concurrent ML
- lightweight threads
- basic primitives:
channel: () -> Channel a
send: Channel a -> a -> ()
recv: Channel a -> a
spawn: (a -> ()) -> a -> ThreadId
- what does mutation look like when you model it as a concurrent
system with these primitives?
type Msg a = Get | Set a
- “two spies in the park” approach to communication
- a new thread per variable
- becomes unmanagable
function BadGuy() {
var x = 0;
var y = 0;
var health = 100;
this.follow = function(hero) { ... };
this.damage = function(n) { ... };
}
- hides multiple mutable cells behind a single interface
- introduces hierarchy
- but: hierarchy ≠ modularity
- reference can be passed from one code unit to another
- introduces extra comm lines between subtrees of hierarchy
- getters and setters on objects
- setters mean readers have ability to bash on objects
- reference can be passed from one code unit to another
- different kinds
- cooperative
- runs a thread until it’s done or it yields
- used by golang, …
- preemptive
- you can bail on something half-way through
- used by erlang, concurrent ML, …
- cooperative
- that old thing concurrency ≠ parallelism
- event loop
- “a poor man’s concurrency”
- javascript
setTimeout(callback, 0);
is the same asspawn callback()
- “Reactive” Programming
- “everything is a stream!”
type alias Stream a = Channel a
- build things up with combinators
map: (a->b) -> Stream a -> Stream b
- glitches
- diamond communication shape
- “everything is a stream!”
diamond : Stream Float -> Stream Bool
diamond stream =
merge
(==)
(map sqrt stream)
(map sqrt stream)
- glitches cont’d
- looked at as a concurrent system, it’s clear that this will
cause glitches
- there’s no reason to expect synchronization
- if you thought of your streams as a concurrent system, you probably wouldn’t write it this way in the first place
- looked at as a concurrent system, it’s clear that this will
cause glitches
- FlatMap is weird!
flatMap : (a -> Stream b) -> Stream a -> Stream b
- it is kind of like a threadpool, but really bizarrely done
- is it even a reasonable use of flatmap?
- dependencies between streams become a rat’s nest
- hero depends on keyboard
- hero’s health depends on location, bad guys, etc
- bad guys depend on hero
- want to chase him
- pause screen depends on
- hero’s health
- keyboard, mouse
- hero depends on pause screen
- maybe I can apply a potion to increase health
- this is crazy
- hero depends on keyboard
- writing high-quality concurrent programs is hard
- everyone is writing accidentally concurrent programs
- immutability
- means no accidental communication channels
- revealing information doesn’t change your architecture
- what about Rust’s approach to mutability where it just emphasizes
ownership?
- ownership and encapsulation are separate concerns
- @astrieanna
- http://blog.leahhanson.us
- julia is a programming language
- “as fast as C” to “2x slower than C”
- julia is for scientists
- engineers, mathematicians, finance people
- non-professional programmers who use programming as a tool
- least time programming for maximum benefit
- what do they need?
- easy to learn
- easy to use
- good for scripts
- good for libraries
- fast enough for medium to large data sets
- fast, extensible math
- linear algebra
- new numeric types
- easy and fast
- how is it better than what they already use? (eg numpy)
- need to learn python
- need to add your own function that doesn’t yet exist
- implement in python
- too slow
- (numpy’s things are written in C with python wrappers)
- learning C is not a good use of our users’ time
- implement in python
- fast julia code is written in julia
- dynamic, interpreted, easy
- JIT compilation
- at run time, you alternate between compiling and running code
- compiler needs to be fast
- static compilers can be slow because they don’t contribute to runtime
- this is no longer true for JITs
- compiler has access to runtime information
- julia is designed up front to be JIT compiled
- type system
- abstract or concrete types
- concrete types:
- can be instantiated
- determine mem layout
- types can’t be modified after creation
- can’t have subtypes
- single supertype
- abstract types:
- can’t instantiate
- new subtypes can be added at any time
- single supertype
- zero or more subtypes
- concrete types:
- abstract or concrete types
- type system
- multiple dispatch
- convenience for extensible math
- single dispatch is method dispatch we’re familiar with
x.foo(y)
– code impl depends on type ofx
- multiple dispatch can depend on any param:
foo(x,y)
– code impl depends on types ofx
andy
- all named fns are generic
- each fn has one or more methods
x = ModInt(3,5)
x + 5 # ModInt + Int
5 + x # Int + ModInt
function Base.+(m::ModInt, i::Int64)
return m + ModInt(..)
end
- this just can’t be done nicely with single dispatch
- dispatch happens a lot, so needs to be really fast
foo(5)
#dispatch, compile, runfoo(6)
#dispatch, runfoo(7)
#dispatch, run
- possible signatures:
- foo(Int64)
- foo(Number)
- foo(Float64)
- foo(Int64,Int64)
- do we need to keep redispatching? aren’t we always going to pick
foo(Int64)?
- add a cache
foo(5)
#cache, dispatch, compile, runfoo(6)
#cache, runfoo(7)
#cache, run- call into hashmap rather than going through a list
- define Base.* in terms of +(Number,Number) and -(Number,Number)
- C++: templates vs dynamic dispatch
- dynamic dispatch is generic
- single Base.* impl to cover all Number instances
- templates do aggressive specialization
- individual Base.* impl for each Number instance
- tradeoff: code size vs speed
- dynamic dispatch is generic
- inlining
- makes more optimizations available
- not reusable
- devirtualization
- write down the IP to avoid DNS
- ie hardcode the method implementation rather doing dispatch
- why do we normally dispatch?
- we’re avoiding dispatch as much as possible for performance
- worth knowing why we have it in the first place
- repl work
- redefine a function to fix a bug, want callers to reflect changes
- issue 265
- repl work
- non-concrete types have to be boxed
- immutable types
- kind of concrete type
- once created, value can’t change
- boxed immutable values are bad for perf
- immutable types can’t appear to change
- we have to copy them each time we want to change them
- new heap allocation
- julia has lisp-style macros
- macros are evaluated at compile time
- macros should be used sparingly
- Horner’s rule
- a*x*x + b*x + c
- skip a multiply with (a*x + b)*x + c
- for a cubic: 6 multiplies down to 3
- for a quartic: 10 multiplies down to 4
- etc (O(n^2) down to O(n))
- Horner’s rule can be implemented as a macro
- taken from PR 2987
- 4x faster than matlab
- 3x faster than scipy
- both of which call C/Fortran libs
- Julia is faster than C!
- compiled julia methods will have inlined constants, which are very optimizable
- C would involve a run-time loop over the array of coeffs
- LLVM can optimize the julia better than the C
- an abstraction which makes things faster
- scientists like easy & fast
- dynamic for easy (but with limits)
- compiled for fast
- multiple dispatch for making fast code concise
- what about R?
- R is slow for things that need loops (like monte carlo)
- does julia have support for openmp?
- I think so?
- you can call libs that use OpenMP
- there’s experimental support for OpenMP within julia
- who needs booleans?
- a boolean just makes a choice! They can just be functions!
true x y = x
false x y = y
- we can then define if-then-else
ifte bool t e
=bool t e
- who needs integers?
- a (positive) integer just counts loop iterations
two f x = f (f x)
one f x = f x
zero f x = x
- to recover a “normal” integer…
two (+1) 0
- gives 2!
add m n f x = m f (n f x)
mul m n f x = m (n f) x
- a (positive) integer just counts loop iterations
- factorial!
fact n =
ifte (iszero n)
one
(mul n (fact decr n))
This means: booleans, integers, (and other data structures) can be entirely replaced by functions!
- “cannot construct the infinite type…”
- ~fact
- (forall a. (a->a) -> a -> a)…~
-
- LISP!
- This got lots of people excited!
- esp Peter Landin (see The Next 700 Programming Languages and ISWIM)
(maplist f (reverse l))
≡(reverse (maplist f l))
- …only if f is side-effect free
- John Backus, “Can Programming Be Liberated from the von Neumann
Style? A Functional Style and Its Algebra of Programs”
- Turing award 1977; paper 1978
- “Inherent defects at the most basic level cause them [conventional languages] to be both fat and weak”
- defects such as:
- inherited from the von Neumann machine
- the von Neumann way of designing machines has a terrible effect on writing programs
- don’t focus on the word-at-a-time bottleneck; focus on whole values
- what we would today call “higher order functions”
- he called map f “αf”
- [f1,f2,..,fn] ∘ g ≡ [f1 ∘ g, f2 ∘ g,…, fn ∘ g]
- g ∘ [f1,f2,..,fn] ≡ [g ∘ f1, g ∘ f2,…, g ∘ fn]
Got a bit APLy:
Def IP = (/ +) ∘ (α ×) ∘ Trans
- functional geometry
- escher’s square limit fish
- devised an algorithmic definition of escher’s picture
- operators on pictures such as
rot
,over
,above
- higher-order operators
cycle
,quartet
- operators on pictures such as
- then: define pictures as functions
p(a,b,c)
is the picture at position a, with height along vector b and width along vector c- other combinators become easy to describe here:
- eg
over (p,q) (a,b,c)
becomesp(a,b,c) ∪ q(a,b,c)
- eg
- this functional geometry avoids Backus’s defects:
- combining forms
- operations on whole values
- algebra with laws
- and as a bonus: functions as representations
- taking it further: Conal Elliot’s Pan
- Haskell: Paul Hudak got DARPA money to work on Haskell as a prototyping language
- functions as data
type Region = Point -> Bool
- predicate representation of set of points
- reactions…
- “too cute for it’s own good”
- “higher-order functions just a trick, probably not useful in other contexts”
- Henderson and Morris, A lazy evaluator
- Friedman and Wise, CONS should not evaluate its arguments
- “The Whole Value” can be infinite!
- the infinite list of natural numbers [0, 1, 2, 3, …]
- all the iterations of a function
iterate f x = [x, f x, f (f x), ...]
- some things can be expressed very nicely with lazy eval
- eg newton-raphson, numerical derivative
- John Hughes, Why Functional Programming Matters
- lazy producer-consumer pattern
- producer: successive approximations of a numerical algorithm
(eg derivative, integral)
- consumer: limit detection
- producer: search space
- consumer: search strategy
- producer: successive approximations of a numerical algorithm
(eg derivative, integral)
- pretty printing
- make choices between horizontal and vertical layout
- vertical: preserve indentation
- horizontal: discard indentation
- combinators:
a $$ b
puts a above b;a <> b
puts a beside b - combinators obey laws:
- <> and $$ are both associative
- once we understand the laws, the implementation can use the laws to make the best layout
- make choices between horizontal and vertical layout
- Koen Claessen, John Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”
- randomly generate test cases from properties (properties based on laws)
- then shrink failing test cases to minimal counterexamples
- This is another lazy producer-consumer!
- producer: space of all possible tests
- consumer: QuickCheck search strategy
- Introduction to VLSI systems
- was intended to revolutionize programming
- did revolutionize hardware design
- A VLSI circuit is not a million miles away from Escher’s square
limit!
- we used functional geometry ideas to design VLSI circuits
- we need to be certain we haven’t made a mistake as the feedback cycle from the fab was too slow (9 months)
- use Backus’s FP to create a language that describes streaming
circuits
- then use algebraic laws of the language to optimize the circuits
- symbolic simulator to test
- muFP
- hardware sorting based on doubly-recursive definition:
- having a language of combining forms makes it much easier to
reason about sorting networks
- though there are still outstanding unsolved problems
- Satnam Singh: Lava language http://blog.raintown.org/p/lava.html
- Pentium float division bug
- very costly mistake
- they suddenly got interested in formal verifiction
- forte
- bluespec bsv
- bluecheck
- QuickCheck in hardware design!
- iterative deepening and shrinking on FPGA designs!
- bluecheck
- functions as representations
- whole values
- simple laws
- combining forms
- subtitle: Adventures in Java Concurrency
- @mjpt777
- latency: responding in a timely manner
- the more things you try to juggle, the harder it becomes to predict how responsive you can be
- if you don’t respond in a timely manner, you’re effectively unavailable
- concurrent algorithms
- notifying completion
- mutex
- synchronization/rendezvous
- systemic pauses
- JVM Safepoints (GC etc)
- Transparent Huge Pages (linux)
- Hardware
- two approachs: locks and CAS
- both are hard to get right!
- queueing theory
- response time is sum of:
- enqueue time
- latent time (time in queue)
- dequeue time
- service time (time being processed)
- response time is sum of:
- queueing theory focuses on latent time and service time, but sometimes enqueue and dequeue dominate!
- relationship between utilization and increase response time
- evils of blocking
- condition variables
- condition variable RTT (echo service)
- request: ping; response: pong
- measure histogram of response times
- don’t use averages! they lie!
- 90th percentile: 8 µs
- 8 µs is loads of time to talk between two threads!
- 2 µs is enough to go between two machines on the same network!
- but it gets worse at higher percentiles
- max is 5525 µs!
- bad news: this is the best case scenario!
- context for benchmarks:
- https://github.com/real-logic/benchmarks
- java 8u60
- quad core machine
- producers: 1, 2 or 3 (can’t have more on a quad-core machine!)
- measuring: mean and 99th percentile
- baseline (wait-free) vs ArrayBlockingQueue vs LinkedBlockingQueue vs ConcurrentLinkedQueue
- burst length: 1 then 100
- real world traffic is bursty
- send a burst of 100; wait for 1 ack
- other considerations:
- Backpressure (queues should have bounded size)
- size methods
- flow rates
- garbage
- fan out
- in the real world, I cannot use java’s built-in queues, partly
because they lack features I need
- if you call
.size()
on ConcurrentLinkedQueue, it walks the whole list
- if you call
- get over the garbage problem
- preallocate a ring buffer and reuse it
- producers claim a slot using CAS
- once a slot is claimed, can write to the slot
- once written, can update the cursor
- consumer updates gating
- no need for locks!
- wrinkle: how do I deal with setting the cursor?
- I can’t update cursor to
n
before it has reachedn-1
- thread claims slot; gets swapped out before writing to slot
- blocks whole queue until thread swaps back in (many ms!)
- I can’t update cursor to
- cursor is used as CAS loop
- add an “available” array to mark when things are written
- made a huge difference
- disruptor isn’t best used as a queue (better for coordinating graph of dependencies)
- ManyToOneConcurrentArrayQueue (influences: Lamport + Fast Flow)
- link?
- tail: CAS counter
- head: counter
- reference ring buffer
- Disruptor and ManyToOneConcurrentArrayQueue behave better in low contention
- gets worse under the bursty case
- each over 50 µs at this stage
- spinning CAS loops are the great equalizer
- ring buffer
- spinning CAS on tail counter
- write message once space claimed
- write header to indicate completion
- (zero out memory on consumption to make this work)
- MPSC Ring Buffer
- latency is good but throughput takes a hit from zeroing memory
- http://highscalability.com/blog/2014/11/17/aeron-do-we-really-need-another-messaging-system.html
- we wanted a CRDT
- replicated to another machine
- may be reordered or duplicated
- want to reassemble deterministically
- big shared-memory file
- tail pointer (but no head this time)
- advance tail, write message, write header (just as above)
- does the file really just go on forever?
- problems with this: page faults; page cache churn; VM pressure
- wash one, wear one, dry one
- active file, dirty file, clean file
- XADD instruction
- available in JVM now
- ring buffers can’t use this because the tail can overtake the head
- if you move the tail beyond the end of the buffer, you can
detect this
- your responsibility is to rotate the buffer now
- messages persist while active & dirty buffers are around
- append-only data structures can be safe to read without locks
- they can detect when they’ve reached the end
- zeroing problem from before: how do we avoid the throughput
problem?
- do it in a background thread!
- the old “dirty” becomes the new “clean”
- aeron IPC is slower than ManyToOneConcurrentArrayQueue in the happy case (no burst, one producer, average time)
- for bursty, 3 producers, 99th percentile:
- ManyToOneConcurrentArrayQueue breaks below 50µs to 34µs
- less “False Sharing”
- inlined data vs Reference Array
- less “Card Marking”
- Aeron gets even further down to 13µs (ish?)
- also mean is 10µs so this is phenomenal predictability
- avoids the spinning CAS
- ManyToOneConcurrentArrayQueue breaks below 50µs to 34µs
- the abominations we have in Java for logging are disgusting!
- the APIs, the design, etc
- what design should we use?
- C vs Java
- Spin loops
- modern processors do out-of-order speculation
- with spin loops, they speculate wrong
- Proposed:
Thread.spinYieldHint()
- Proposed:
- data dependent loads
- heap aggregates: objectlayout
- stack allocation - value types
- memory copying
- baseline against
memcpy()
for differing alignment
- baseline against
- queue interface
- break conflated concerns to reduce blocking actions
- API design mistakes
- conflates “is something available?” with “is it empty?”
- should not have used normal collections API for concurrent stuff
- AWS recently announced x1 instances
- 2TB RAM + 100+ vcores!
- lots of spinning wheels
- can you tell us more about XADD in java?
- AtomicInteger, AtomicLong, etc
- “Consistency guarantees? hahaha” (cue manic laughter)
- I’m writing a book!
- @martinkl
- 1975: IBM System R
- 2008? NoSQL
- 2012? “NewSQL”
- eg SQL on top of HBase
- really a movement of transactions-or-not
- Härder & Reuter, 1983
- “More mnemonic than precise” - Brewer, 2012
- archive tape?
- fsync to disk?
- replication?
- not the same as C in CAP Theorem!
- aside: critique of cap theorem
- “tossed in to make the acronym work”
- not the same as in “atomic compare-and-swap”
- not about concurrency
- it’s about fault tolerance
- rolls back writes on abort
- better word might have been Abortability
- aborts are super helpful because they collapse a whole bunch
of error classes:
- crashes, power failures
- constraint violation
- network fault
- deadlock
- this is where we start talking about concurrency
- =
SERIALIZABLE
? - in practice, multiple isolation levels
- Repeatable Read
- Snapshot Isolation
- Read Committed
- Read Uncommitted
- can anyone here describe the difference of the top of their head? (one hand goes up)
- this is a configurable property of many databases
- though many don’t go as high as Serializable
- and some call it Serializable but are actually weaker than this
- read committed prevents these anomalies
- default in postgres, oracle, …
- that’s enough for transactions, right?
- what are the other weird isolation levels?
- this can happen under read committed!
- concurrent transaction:
-
- add 100 to x
- subtract 100 from y
- commit
-
- read y
- read x
-
- reader can see old value of y, but new value of x
- Repeatable Read and Snapshot Isolation were invented to prevent this
- Traditional RR implementation is based on locking
- SI is used more often now to avoid locks
- poorly named in most DBs
- invariant: at least one doctor on call at all times
- doctors can trade shifts
- transaction:
- count how many doctors are on call
- if >=2:
- update
- commit
- but: two txns operate concurrently for two different doctors:
- both see n >= 2
- both remove themselves from rota
- invariant violated!
- pattern:
- measure
- decide
- write
- commit
- by the time the write is committed, the premise of the decision is no longer true!
- solutions
- two-phase locking (pessimistic)
- shared lock on measure step
- block entire table for writes, whenever you make a large read
- poor for perf
- motivation for weaker isolation levels
- System R team just didn’t hold locks for so long
- this is what created their crazy isolation levels
- shared lock on measure step
- H-Store (Stonebraker et al)
- VoltDB, datomic
- literally execute your txns in a serial order
- trivally correct
- as long as you make your txns fast enough!
- otherwise your throughput will be abysmal
- SSI (Cehill Röhm & Fekete)
- Postgres
- detect conflicts & abort
- optmistic counterpart to pessimistic 2PL
- Idea: locks like in 2PL
- but locks don’t block, just gather information
- two-phase locking (pessimistic)
- microservices?
- stream processing?
- serializability across services?
- strong connection between all of these things
- coordination
- failure amplification
- geographic distribution makes this worse
- distributed txns don’t work particularly well
- so where are we now?
- compensating transactions
- ≅ abort/rollback at app level
- looks a bit like ACID A
- apologies
- detect & fix contraint violations after the fact, rather than
preventing
- overselling beyond stock level; issue voucher instead
- looks a bit like ACID C
- detect & fix contraint violations after the fact, rather than
preventing
every sufficient large deployment of microservices contains an ad-hoc informally-specified bug-ridden, slow implementation of half of transactions
- ordering of events
- unfriend
- post nastygram to remaining friends
- friends service and posts service are separate services
- notifications services pushing out emails etc, reading from posts and friends services
- reordering due to delays
- notification goes to old set of friends, not new set of friends. oops!
- implicit causal relationships leading to violated expectations
- things between strict serializability and eventual consistency
- causality:
- partial order of reads/writes
- can be maintained without global coordination
- “consistent snapshot” in SI
- consistent with causality
- but also efficient?
- isolation levels are divided into those that require coordination and those that don’t
- four slides’ worth!
- how did we end up with this mess? naming etc
- on the academic side, the database community and distributed
systems community have historically been completely separate
- evolved separate language for similar concepts
- wouldn’t have many of current problems if they had talked to each other in 1980s
- SQL standard had problems with isolation levels
- failed to describe them in an implementation-neutral way
- this is why Postgres uses SI but calls it
REPEATABLE READ
- on the academic side, the database community and distributed
systems community have historically been completely separate
- also see hermitage, a tool to test what kind of anomalies different isolation levels of dbs allow