Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.
For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's Distributed Computing course (and HT to @lalithsuresh) Book Site
Notes from courses like Lorenzo Alvisi's Distributed Computing class can be great.
There are a bunch of classics on causality, Paxos (and more practical takes on Paxos), and distributed snapshots.
Edit: aside from these below, Alex Feinberg and Henry Robinson's lists at this Quora post contain a bunch of good practically-oriented but theoretically grounded papers.
Practical databases:
- Consistency in Partitioned Networks PDF ACM A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services PDF Megastore gives a reasonable example of a Paxos-based database architecture.
- Consistency Tradeoffs in Modern Distributed Database System Design PDF IEEE is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
- There are many remnants of the Bayou project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is The Bayou Architecture: Support for Data Sharing among Mobile Users. Also good is Managing update conflicts in Bayou, a weakly connected replicated storage system. Definitely a more practically oriented paper. Optimistic Replication PDF ACM is a great survey of similar techniques.
More formal stuff:
- Unreliable failure detectors for reliable distributed systems PDF ACM A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with The Weakest Failure Detector for Solving Consensus make for a great if tough tutorial on failure detectors (may be better off reading a textbook) PDF ACM
- (Even better, A short introduction to failure detectors for asynchronous distributed systems PS.GZ ACM)
- The Byzantine Generals Problem PDF ACM introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)