This course is at an undergraduate level, likely situated in third or fourth year. Students should feel programming language concepts, including recursion, as well as proof techniques, including induction.
- Introduction: Clouds, MapReduce, Key-value stores
- Classical Precursors: Peer-to-peer systems, Grids
- Widely-used algorithms: Gossip, Membership, Paxos
- Classical algorithms: Time and Ordering, Snapshots, Multicast
- Fun: Interviews with leading managers and researchers, from both industry and academia
- Cloud computing, Amazon AWS, Google Compute Engine, Microsoft Azure, etc.; public and private clouds
- What is a cloud -- cluster, supercomputer, datastore, all of the above; massive scale, on-demand
- HaaS (hardware), IaaS, PaaS, SaaS
- Trends -- data-centers, time-sharing, grids, clouds
- Economics of cloud computing
- Terms are borrowed from functional language
- WordCount application as MapReduce task
- Map parallelly processes a large number of records
- Reduce merges all intermediate results
- More applications -- Distributed Grep, Reverse Web-Link Graph, URL access frequency, Sort
- Yarn Scheduler -- Global Resource Manager (RM), per-server Node Manager (NM), per-application Application Master (AM)
- Problems -- server/application failure (fault tolerance), slow servers (stragglers), locality and data replication
- Multicast protocol for fault tolerance and scalability, centralized or tree-based (build a spanning tree)
- Epidemic multicast, or Gossip -- periodically transmit b messages, then those nodes do the same after receiving multicast
- Push gossip, also pull, and hybrid variants
- Properties -- light-weight, fault-tolerant, can be topology-aware
- How to deal with failures -- fault detectors, membership protocol to disseminate membership and failures
- Distributed fault detectors, desired properties:
- Completeness -- each mistake is detected
- Accuracy -- there is no mistaken detection
- Speed -- time to first detection of failure
- Scale -- equal load on each member, network load
- Impossible to achieve perfect completeness and accuracy at the same time, so accuracy is partial
- What is the best failure detector, what is the optimal
- Swim failure detection algorithm, uses delegation to ask other nodes to do the ping
- Dissemination options -- multicast, point-to-point, piggyback on pings, "infection" style dissemination
- Suspect a failure before declaring it as such
- Applications and Scheduling, Two-level scheduling, inter-site and intra-site protocols
- Grids are federated, no single entity controls the entire infrastructure
- In clouds the focus is on failures, scale, on-demand
- First distributed systems that seriously focused on scalability with respect to number of nodes
- Napster stores a directory, filenames with pointers to their location; no security, responsible for user' copyright violations
- Centralized server is single point of failure, congestion
- Gnutella
- Eliminates the servers so clients search and retrieve among themselves
- Five message types -- ping, pong, query, query-hit and push
- Avoids excessive by keeping recently received messages
- Problems -- ping-pong consumes 50% of traffic, repeated searches, free-loaders, flooding
- FastTrak is a hybrid between Napster and Gnutella, uses supernodes to store file pointers, any peer can become supernode provides it has earned reputation
- BitTorrent
- Chord -- distributed hash table (DHS), uses Ring of peers to achieve O(log(N)) for memory and lookup latency
- Peer pointers -- successors and finger table
- Search under peer failure
- Deal with dynamic changes, churn
- Stabilization protocol and rounds
- Pastry -- routing tables based on prefix matching, hops are short (in the underlying network)
- Kelips -- k affinity groups, each node hashed to a group; filename hashed to a group, any node in the group may store the file
- Distributed dictionary data structure, similar to DHT, sort of database
- Today's workloads -- lots of random reads and writes, joins infrequent, foreign keys rarely needed, need to scale out not scale up
- NoSQL => Not Only SQL, operations like get and put, may not have schema, don't always support joins or have foreign keys
- NoSQL systems often use column-oriented storage
- Cassandra uses a ring-based DHT but without finger tables or routing, replication strategy is SimpleStrategy or NetworkTopologyStrategy for multi-DC deployments
- Snitch options --- IPs to racks and DCs maps (config file), SimpleSnitch (unaware), RackInferring, EC2Snitch
- Writes go to Coordinator (per-key, per-client or per-query) that uses Partitioner to send query to all replica nodes responsible for the key
- When replica is down the Coordinator writes to all other replicas and keeps the write locally until it comes back up
- Writes on replica nodes go to commit log, then to memtable -- write-back cache as opposed to write-through; when full memtables are flushed to _SSTable (Sorted String Table); also uses Index file and Bloom filter; SSTables and logs nee dto be compacted
- Deletes don't remove items right away, instead tombstone is written to the log, until compaction
- Reads are similar to writes
- CAP Theorem -- distributed systems can satisfy at most 2 out of 3 guarantees -- consistency , availability, partition tolerance
- NoSQL databases provide eventual (weak) consistency, traditional RDBMS favors strong consistency over availability under a parition
- Cassandra consistency levels -- ANY, ALL, ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM
- Newer consistency levels -- causal consistency, per-key sequential, red-blue consistency, sequential consistency
- HBase -- Google's BigTable, Yahoo open-sourced it
- Time synchronization is required for correctness and fairness; processes in Internet-bases systems follow an asynchronous system model
- Clock skew -- relative difference between clock values; clock drift -- relative difference between clock frequencies
- Need to synchronize any pair of clocks at least once every: "maximum acceptable skew" / ( 2 * "maximum drift rare" )
- External synchronization -- Christian's Algorithm and NTP; allowed to increase but should never decrease clock value; error can never be zero; can we avoid synchronizing clocks altogether and still be able to order events
- Instead assign timestamps that are not absolute time -- as long as they obey causality that should work
- Logical ordering proposed by Lamport in the 1970s, used in almost all distributed systems today
- Logical relation Happens-Before defines partial ordering, not all events are related to each other
- A process increments its counter when a send or an instruction happens; on receive the counter may be updated and is incremented
- Lamport timestamps not guaranteed to be ordered or unequal for concurrent events
- Vector timestamps -- on send or instruction event at process i only i-th element of the vector clock is incremented
- VT1 = VT2, VT1 <= VT2, VT < VT2 (causally related), or NOT (VT1 <= VT2) and NOT (VT2 <= Vt1) (concurrent)
- Obeys causality, uses more space but can also identify concurrent events
- Global snapshot = Global State = individual state of each process + individual state of each communication channel in the distributed system
- Naive solution -- synchronize all clocks and asks processes to record their states at known time
- Time synchronization always has errors
- Does not capture state of messages in the channels
- Synchronization is not required, causality is enough
- System model -- N processes, two unidirectional channels between each ordered process pair, channels are FIFO, no failures, no transmission errors (corruption or duplication)
- Snapshots should not interfere with the normal application actions
- Chandy-Lamport global snapshot algorithm uses "marker" messages
- Consistent cuts
- Cut is a time frontier at each process and at each channel
- Events that happen before the cut are in the cut, otherwise out of the cut
- Consistent iff for each pair of events e,f if e is in the cut and f -> e, then f is also in the cut
- Any run of the Chandy-Lamport global snapshot algorithm creates a consistent cut
- Correctness in distributed systems:
- Liveness -- something good will happen, eventually
- Safety -- something bad will never happen
- Difficult to satisfy both liveness and safety in an asynchronous distributed system
- Problem, D=definition (unicast, broadcast, multicast), applications; 3 protocols -- FIFO, causal and total ordering
- FIFO ordering -- multicasts from each sender are received in the order they are sent
- Causal ordering -- multicasts whole send events are causally related, must be received in the same causally-obeying order at all receivers
- Causal ordering => FIFO ordering, but the reverse is not true
- Total ordering -- ensures all processes receive all multicasts in the same order
- Implementation
- FIFO -- each receiver maintains a per-sender sequence number
- Total -- one process elected as leader or sequencer that maintains the order
- Causal -- each receiver maintains a vector of per-sender sequence numbers
- Reliability of multicasting is orthogonal to ordering
- To reliably send multicasts -- sender sequentially sends a reliable unicast message to all group recipients, and each receiver also sequentially sends the message to all group processes; not efficient but reliable
- Virtual Synchrony combines membership protocol with a multicast protocol; can not implement consensus because VSync groups are susceptible to partitioning
- Problem -- a group of processes attempting to elect a leader, ensure mutually exclusive access to a resource, receive updates in the same order; coordinate and reach agreement
- What is consensus -- N processes, each process has input variable xp and output variable yp, design a protocol so that at the end all processes set yp to all 0s or all 1s
- Constraints -- validity, integrity, non-triviality
- Two models of distributed systems
- Synchronous -- each message is delivered within bounded time, local clock drift has known bound, each process takes finite time to complete
- Asynchronous -- no bounds for process execution, drift rate of a clock is arbitrary, no bounds on message transmission delays
- In the synchronous system model consensus is solvable
- For a system with at most f processes that may crash the algorithm proceeds in f+1 rounds (with timeout), using reliable communication for all members
- In the asynchronous model consensus is impossible to solve -- can not distinguish between failed process from one that is very very slow
- Paxos does not solve the consensus problem but provides safety and eventual liveness
- Paxos has rounds, each with a unique ballot id
- Rounds are asynchronous, time synchronization is not required
- Each round is broken into phases -- leader election, leader proposes a value and processes acknowledge (bill), leader multicasts final value (law)
- What is the point of no-return
- Fischer, Lynch, and Patterson, 1983 (FLP) proof