Time clock and the ordering of events in a distributed system
Consensus on Transaction Commit
An empirical study on the correctness of formally verified distributed systems
Byzantizing Paxos by Refinement
Impossibility of Distributed Consensus with One Faulty Process
an asynchronous consensus algorithm cannot be guaranteed to be both safe and live. This is called the FLP Impossibility Result
Blockchains from a Distributed Computing Perspective ht
Seeing is Believing: A Client-Centric Specification of Database Isolation
CASPaxos: Replicated State Machines without logs
On Designing and Deploying Internet-Scale Services
Composition: A Way to Make Proofs Harder mentioned here
Paxos Made Live - An Engineering Perspective. Red Hat contributes etcd, the cornerstone of Kubernetes, to the Cloud Native Computing Foundation. HN.
Building Consistent Transactions with Inconsistent Replication.
A generalised solution to distributed consensus. HN
Atul Adya’s PhD thesis. Natacha Crooks.
Atul Adya’s PhD thesis gives a precise definition of the SQL standard isolation levels based on how reads and writes from different transactions may be interleaved. However these definitions are given from the point of view of the system. The recent work by Natacha Crooks et. al gives elegant and precise definitions from the point of view of the user.
Interactive checks for coordination avoidance
scaling replicated state machines
Because serializability allows arbitrary reordering of operations (so long as the order appears atomic), it is not particularly useful in real applications. Most databases which claim to provide serializability actually provide strong serializability, which has the same time bounds as linearizability. To complicate matters further, what most SQL databases term the SERIALIZABLE consistency level actually means something weaker, like repeatable read, cursor stability, or snapshot isolation.
Consistency means linearizability, and in particular, a linearizable register.
Serializability, linearizability, and locality
Difference between Linearizability and Serializability
Linearizability and Serializability in context of Software Transactional Memory
Linearizability versus Serializability
Serializability is the traditional “I,” or isolation, in ACID. If users’ transactions each preserve application correctness (“C,” or consistency, in ACID), a serializable execution also preserves correctness. Therefore, serializability is a mechanism for guaranteeing database correctness.1
Granted, serializability is (more or less) the most general means of maintaining database correctness. In what’s becoming one of my favorite “underground” (i.e., relatively poorly-cited) references, H.T. Kung and Christos Papadimitriou dropped a paper in SIGMOD 1979 on “An Optimality Theory of Concurrency Control for Databases.” In it, they essentially show that, if all you have are transactions’ syntactic modifications to database state (e.g., read and write) and no information about application logic, serializability is, in some sense, “optimal”: in effect, a schedule that is not serializable might modify the database state in a way that produces inconsistency for some (arbitrary) notion of correctness that is not known to the database.
Combining serializability and linearizability yields strict serializability: transaction behavior is equivalent to some serial execution, and the serial order corresponds to real time. For example, say I begin and commit transaction T1, which writes to item x, and you later begin and commit transaction T2, which reads from x. A database providing strict serializability for these transactions will place T1 before T2 in the serial ordering, and T2 will read T1’s write. A database providing serializability (but not strict serializability) could order T2 before T1.2
As Herlihy and Wing note, “linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.”
Linearizability, serializability, transaction isolation and consistency models
Please stop calling databases CP or AP
Moreover, databases with snapshot isolation/MVCC are intentionally non-linearizable, because enforcing linearizability would reduce the level of concurrency that the database can offer. For example, PostgreSQL’s SSI provides serializability but not linearizability, and Oracle provides neither. Just because a database is branded “ACID” doesn’t mean it meets the CAP theorem’s definition of consistency. Is this quote really correct?
Seriablizable but not linearizable
How does two phase commit recover from a participant's failure?
How ACID is the two-phase commit protocol?
Notes on distributed systems for youngbloods some additional commentary
Testing Distributed Systems for Linearizability
Scalability Cheatsheet: The Road to Paxos
The Limits of the CAP Theorem HN
The only time that a CAP-Available system would be available when a CAP-Consistent one would not is when one of the datacenters can’t talk to the other replicas, but can talk to clients, and the load balancer keeps sending it traffic.
the 'A' in CAP is boring. It does not mean what you think it means. Lynch et al. probably chose the definition because it's one for which the 'theorem' is both true and easy to prove. This is not the impossibility result with which designers of distributed systems should be most concerned.
The CAP Theorem - Foundation DB
Clarifications On The CAP Theorem And Data-Related Errors
Paxos actually solves a more genreal problem than 3PC.
Consensus on Transaction Commit
Delivering Billions of Messages Exactly Once HN
I don't want to ever see the phrase "Exactly Once" without several asterisks behind it. It might be exactly once from an "overall" point of view, but the client effectively needs infinitely durable infinite memory to perform the "distributed transaction" of acting on the message and responding to the server.
how do you cut a monolith in half? HN
Exactly-once Semantics are Possible: Here’s How Kafka Does it reddit nope
Providing API for building applications that have transactions and help with idempotent producing is really exciting. This will help lower a lot of pains associated with building stream processing systems. Doing it with very little performance overhead is amazing.
Indeed. Idempotent operations is the mainstay of managing "at least once" message delivery. If you had a true "exactly once" system, idempotency would be unnecessary.
Manage Kubernetes Clusters on AWS Using Kops
The problem with renting boxes is the hidden costs if you want to do it right. First of all, if you have anything mission critical, you need to run it in a high availability config, this is easy for stateless microservices, but when it comes to running your DB, you start renting three boxes instead of one or two and configuring them accordingly. And then you setup your Backup Infrastructure for disaster recovery, Glacier needs a replacement after all. No problem, just more disks(?) on a few more boxes(?) and bacula(?), better in a different Datacenter just to be on the safe side, it would be nasty if you whole rack gets fried and your data with it. Don't forget to backup your configuration, all of it. Loadbalancers, Server Environment Variables, Network (do you have an internal DNS?), Crontabs, some businesses need their audit logs stored etc... On the infrastructure level there is lots and lots of stuff you can do and you won't ever really need AWS, you'll just spend significantly more time finding and administering the right solutions than just using the AWS Solutions where you'll find a treasure trove of great tutorials and can relatively cheaply pay for support. If you then pay someone on top for 24/7 management/monitoring of your dedicated stack so that your team doesn't have to get up at 3 am because one of your VMs disk fills because some stray logfile is filling the disc, many of the savings you had by setting it up on a dedicated server go out of the window because the management partner needs to train their people to look into your infrastructure. AWS only Management Partners are just light-years cheaper because they can streamline their processes much better. You could also hire your own team of admins... Sure AWS is a beast with its own surprises, but overall the cost/benefit ratio is still very fair even if you factor in all the "surprises"(many of which your management partner will probably know about). Having layered support is really something beneficial aswell. If something is wonky with RDS, you get to call your management partner if he didn't detect it before you, who if he can't tackle it himself can call AWS technicians. This gets you much much further than you would get elsewhere. The outside the world is paying for (for example) perconas consultants or someone similar if the problems grow over their team's head. Sure, at some point in a companies growth, depending on how technical the operation is, there might be a time where an admin team and colocation/dedicated boxes make sense, where AWS technicians will scratch their heads etc., especially if you have some very very specific tasks you need to do. But for most people this is far off if ever.
Serializability, linearizability, and locality
Recall that strict serializability is essentially serializability plus linearizability’s real-time constraint: transactions cannot be arbitrarily re-ordered, but must appear to take place atomically at some time between their invocation and completion. When we add real-time constraints to sequential consistency, we get linearizability: a local property. Why can’t we add real-time constraints to serializability and obtain locality? Why don’t real-time multi-object transactions compose?
We can view strict serializability as linearizability plus the multi-object transactions of serializability. But in another sense, linearizability is strict serializability with the constraint that transactions are constrained to act on a single object, because that restriction provides locality.
Linearizability: A Correctness Condition for Concurrent Objects (1990)
Sequential consistency is equivalent to linearizability without condition L2. Serializability is analogous to sequential consistency with transactions, and strict serializability is analogous to linearizability with transactions. Sequential consistency, serializability, and strict serializability do not have the same locality and non-blocking properties as linearizability. Moreover, serializability and linearizability are for different domains. Serializability works well for databases because application developers should be able to easily express complex transactions. Linearizability is better for infrastructure in which the developer is willing to spend considerable effort to maximize concurrency.
Consensus Protocols: Two-Phase Commit
Consensus Protocols: Three-Phase Commit
3PC works very well when nodes may crash and come to a halt – leaving the protocol permanently when they encounter a fault. This is called the fail-stop fault model, and certainly describes a number of failures that we see every day. However, especially in networked systems, this isn’t the only way in which nodes crash. They may instead, upon encountering a fault, crash and then recover from the fault, to begin executing happily from the point that they left off (remember that, with stable storage that persists between crashes, there’s no reason that a restarted node couldn’t simply pick up the protocol from where it crashed). This is the fail-recover fault model, which is more general than fail-stop, and therefore a bit more tricky to deal with.
Similarly, heavy network load can delay the delivery of a message for an arbitrary period. In a synchronous network model, there is a bound on the amount of time a remote host can take to process and respond to a message. In an asynchronous model, no such bound exists. The key problem that asynchronicity causes is that time outs can no longer be reliably used as a proxy for failure detection
There are other ways that Paxos can go wrong. Acceptors need to keep a record of the highest proposal they have agreed to, and the value of any proposals they have accepted, in stable storage. If that storage should crash then the acceptor cannot take part in the protocol for fear of corrupting the execution (by effectively lying about the proposals it has already seen). This is a kind of Byzantine failure, where an acceptor deviates from the protocol.
Rather than sacrifice correctness, which some variants (including the one I described) of 3PC do, Paxos sacrifices liveness, i.e. guaranteed termination, when the network is behaving asynchronously and terminates only when synchronicity returns.
In distributed systems, what is a simple explanation of the Paxos algorithm?
3PC is fail-stop resilient, but not fail-recover resilient. Unfortunately real life requires fail-recover and hence we need a more general solution. This is where Paxos comes in.
What happens if we mandate only one Leader at a time in Paxos, and also mandate that instead of majority, all nodes must vote? You are right – we get 2PC. 2PC is a specific case of Paxos.
Byzantine faults, Paxos, and consensus
Paxos is in a sense sort of like RSA. RSA is expensive, so you use it as a wrapper to a cheaper protocol. With Paxos you elect a leader to be in charge of a large-scale event. The coordinator is a single point of failure, so you can use paxos to come up with a new coordinator. Most of the time you don’t run paxos, you only run it when something bad happens to prevent any one thing to be a single point of failure.
Why is memory reclamation so important? HN
You Can’t Sacrifice Partition Tolerance
Fear and Loathing in lock-free programming
Google's Cloud Spanner: how does it stack up?
Spanner was originally built by Google to handle workloads like AdWords and Google Play, that were, according to Google, previously running on massive, manually sharded MySQL implementations
So Paxos is a better, more general version of 2PC, and unlike 3PC, is has been proved correct?
A Comparison of Advanced, Modern Cloud Databases
Concurrent ACID: Whether the database supports ACID (atomicity, consistency, isolation, and durability) guarantees across multiple operations. ACID is a powerful tool for system correctness, and until recently has been a long sought but illusive chimera for distributed databases. I use the term “concurrent ACID” because technically Cosmos guarantees ACID, but only within the confines of a single operation.
modern techniques can achieve CP while still keeping availability that’s incredibly good. Like five or more 9s of good. This result is so optimal that modern databases seem to be converging on it. Every database on the list above is CP with varying levels of A
Time-based consistency. Sophisticated distributed systems like Spanner and CockroachDB tend to need a little more time to coordinate and verify the accuracy of the results that will be returned from any given node, and this makes them less suitable for low latency operations.
Spanner, TrueTime and the CAP Theorem HN
Why you should pick strong consistency, whenever possible
distributed systems reading list
A History of Transaction Histories
Data Laced with History: Causal Trees & Operational CRDTs
Standing on Distributed Shoulders of Giants
Tweets about distributed transactions
The Four Most Expensive Words in the English Language Proofs of Space
The Horrors of Upgrading Etcd Beneath Kubernetes HN
Complex Event Flows in Distributed Systems
An Illustrated Proof of the CAP Theorem hn
Two Generals and Time Machines
Augmenting Agile with Formal Methods
1 like = 1 distributed systems paper
Diagnosing A Weak Memory Ordering Bug
Mind Your State for Your State of Mind at another blog
Distributed Agreement on Random Order
Exploring Stretch Clusters for Red Hat OpenShift Dedicated
New Multi-AZ clusters can be deployed to AWS Regions that have at least three AZs. This allows for the control plane to be distributed with one node in each AZ (one master and one infra node in each AZ). In the event of an AZ outage, etcd quorum is not lost and the cluster can continue to operate normally.
Partitioned consensus and its impact on Spanner’s latency. HN.
An Evaluation of the Advantages and Disadvantages of Deterministic Database Systems.
Introduction to TLA+ Model Checking in the Command Line. HN.
Time to Move on from Two Phase Commit. HN.
Reliable Microservices Data Exchange With the Outbox Pattern.
A Critical Look at Event-Driven Systems
history of protocols distributed systems
Toward Domain-Specific Solvers for Distributed Consistency
Demystifying Database Systems: An Introduction to Transaction Isolation Levels. more on isolation levels. Read committed Snapshot VS Snapshot Isolation Level. follow up.
Real Transactions are Serializable
Lamport about Practical TLA+ hn
Building Robust Systems With ACID and Constraints.
Using Randomized Communication for Robust, Scalable Systems
Panel: The Promises and Perils of Eschewing Distributed Coordination
An explanation of the difference between Isolation levels vs. Consistency levels HN
Distributed consensus reading list great resource!
Gray Failure: The Achilles Heel of Cloud-Scale Systems
Atomic Replication Changes in etcd/Raft
Foundational distributed systems papers
a primer on memory consistency and cache coherence
Distributed Systems, Failures, and Consensus
CIS 505: Software Systems Lecture Note on Consensus
map of consistency models tweet
How to scale a distributed system
Symmetry breaking (with leader election as an example)
Small control plane, big data plane
- Health Checks and Graceful Degradation in Distributed Systems Envoy Service Mesh Case Study: Mitigating Cascading Failure at Lyft
Rate limiting and circuit breaking based on static thresholds and limits can prove to be error-prone and brittle from both correctness and scalability standpoints.
NewSQL databases fail to guarantee consistency and I blame Spanner. HN.
Consensus Systems with Ethan Buchman
Help! I Accidentally Distributed My System!.
Using TLA+ to Understand Xen Vchan
mapConcurrently-alike with concurrency limit, and ordered job start.
Principles of Distributed Computing (lecture collection)
Distributed Systems W4995-1 Fall 2014 Roxana Geambasu lecture 17
3PC trades safety for liveness, 2PC trades liveness for safety. Paxos is "largely live" and blocks only on exceptional circunstances
Notes on Theory of Distributed Systems CPSC 465/565: Fall 2017
Languages and Abstractions for Distributed Programming CMPS290S, Fall 2018.
Correctness Anomalies Under Serializable Isolation
formal foundations of serverless computing
the huge costs of coordination in our systems
Replication theory and practice
-
Stumbling over Consensus Research: Misunderstandings and Issues
-
Implementing Trustworthy Services Using Replicated State Machines
-
Selected Results from the Latest Decade of Quorum Systems Research
-
Emily Short - Five Strategies For Collaborating With A Machine [PROCJAM 2016]
-
Distributed systems theory for the distributed systems engineer
-
An Overview of Distributed Systems and the Consensus Problem
Our concurrent past, our distributed future
Distributed Systems vs Compositionality—Roland Kuhn
Consistency without consensus in production systems https://www.youtube.com/watch?v=lsKaNDj4TrE Martin Kleppmann - Conflict Resolution for Eventual Consistency GOTO 2015 • Coordination-Free Computations GOTO 2014 • How the Bitcoin Protocol Actually Works
Distributed Systems Theory for Practical Engineers
Applied Distributed Research in Apache Cassandra
Four Distributed Systems Architectural Patterns by Tim Berglund
What Came First: The Ordering of Events in Systems
Four Distributed Systems Architectural Patterns by Tim Berglund
[17:00] If you are a service, like Slack, which is composed of a series of organizations which are not as big...
Distributed Systems in One Lesson by Tim Berglund
Think Twice before Dropping ACID
The Practice and Theory of TLA+ more more
Consensus: Why Can't We All Just Agree?
The Future of Distributed Databases Is Relational
Streaming SQL to Unify Batch & Stream Processing w/ Apache Flink
Cluster Consensus: When Aeron Met Raft
CRDTs and Distributed Consensus
How to Build Observable Distributed Systems
Jepsen 7: Anna Concurrenina by Kyle Kingsbury
JavaZone talk: Transactions and Concurrency Control Patterns
Help! I Accidentally Distributed My System!.
A Raft implementation in Haskell.
Everything about distributed systems is terrible | Code Mesh LDN 18.
Apache Zookeeper As A Building Block For Distributed Systems.
VIDEO: Consensus algorithms, Paxos and Raft.
Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59.
High reliability infrastructure migrations
Time, Clocks and Ordering of Events in a Dist. System by Dan Rubenstein [PWL NYC]
Building A "Simple" Distributed System - Formal Verification.
What happens if the server dies after db commit but before saving event to kafka?.
Serializability vs “Strict” Serializability: The Dirty Secret of Database Isolation Levels. HN.
Impossibility of Distributed Consensus with One Faulty Process.
towards language support for distributed systems.
Designing Distributed Systems with TLA+
Lecture 14: Optimistic Concurrency Control and Snapshot Isolation
Patterns for Decoupling in Distributed Systems: Summary Event
Distributed Systems Engineering with Apache Kafka
Correctness Anomalies Under Serializable Isolation
Transactional Outbox pattern - A piece of the eventual consistency puzzle.
Those Who Forget the Past Are Doomed to Repeat It
Strange Loop 2019: "Correctness proofs of distributed systems with Isabelle" by Martin Kleppmann
Distributed consensus, the ability to reach agreement in the face of failures and asynchrony, is a fundamental and powerful primitive for constructing reliable distributed systems from unreliable components. For over two decades, the Paxos algorithm has been synonymous with distributed consensus. Paxos is widely deployed in production systems, yet it is poorly understood and it proves to be heavyweight, unscalable and unreliable in practice.
mergeable replicated datatypes
Designing Distributed Cache - Part I
Correctness proofs of distributed systems with Isabelle
Consistency in Non-Transactional Distributed Storage Systems
CS 144: Introduction to Computer Networking, Fall 2019
My Distributed Systems Seminar's reading list for Spring 2020
A walkthrough tutorial of TLA+ and its tools: analyzing a blocking queue
PigPaxos: Devouring the communication bottlenecks in distributed consensus
AVOID REUSABILITY ACROSS PROCESS BOUNDARIES
Friday afternoon distributed systems thread: quorums and latency
Retries in distributed systems: good and bad parts
"A fault-tolerance shim for serverless computing"
eventual consistency isn't for streaming
How to Build a Highly Available System Using Consensus
One of many timeless works from Butler Lampson. The patterns are still widely in use to this day. The trade offs between lease times and availability are so clearly put, reading this could save a lot of hard earned lessons.
a review of consensus protocols
How you could have come up with Paxos yourself
Helios is Microsoft's high speed stream ingestion and indexing system, but it's more than just that - it's also a reference architecture for their next-generation big data systems
Advanced Join Patterns for the Actor Model Based on CEP Techniques
That's the easy part and typically not entirely what you want.
https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/
https://news.ycombinator.com/item?id=25061901
Fairness in multi-tenant systems
I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.
distributed systems reading list
a list of distributed transaction protocols
scalability comes from avoiding coordination
1. No shared memory. 2. No shared clock.
too much concurrency & lock trashing
We found and fixed a rare race condition in our session handling
how to test a databse - simulation testing. testing distributed systems
Internal Consistency in Streaming Systems
Tail Latency Might Matter More Than You Think
Metastable Failures in Distributed Systems
Getting To Know Logical Clocks By Implementing Them
Scalable but Wasteful or Why Fast Replication Protocols are Actually Slow