Cloud Computing Concepts

This course is at an undergraduate level, likely situated in third or fourth year. Students should feel programming language concepts, including recursion, as well as proof techniques, including induction.

Introduction: Clouds, MapReduce, Key-value stores
Classical Precursors: Peer-to-peer systems, Grids
Widely-used algorithms: Gossip, Membership, Paxos
Classical algorithms: Time and Ordering, Snapshots, Multicast
Fun: Interviews with leading managers and researchers, from both industry and academia

Week 1

Introduction

Cloud computing, Amazon AWS, Google Compute Engine, Microsoft Azure, etc.; public and private clouds
What is a cloud -- cluster, supercomputer, datastore, all of the above; massive scale, on-demand
HaaS (hardware), IaaS, PaaS, SaaS
Trends -- data-centers, time-sharing, grids, clouds
Economics of cloud computing

MapReduce

Terms are borrowed from functional language
WordCount application as MapReduce task
Map parallelly processes a large number of records
Reduce merges all intermediate results
More applications -- Distributed Grep, Reverse Web-Link Graph, URL access frequency, Sort
Yarn Scheduler -- Global Resource Manager (RM), per-server Node Manager (NM), per-application Application Master (AM)
Problems -- server/application failure (fault tolerance), slow servers (stragglers), locality and data replication

Week 2

Gossip

Multicast protocol for fault tolerance and scalability, centralized or tree-based (build a spanning tree)
Epidemic multicast, or Gossip -- periodically transmit b messages, then those nodes do the same after receiving multicast
Push gossip, also pull, and hybrid variants
Properties -- light-weight, fault-tolerant, can be topology-aware

Membership

How to deal with failures -- fault detectors, membership protocol to disseminate membership and failures
Distributed fault detectors, desired properties:
Completeness -- each mistake is detected
Accuracy -- there is no mistaken detection
Speed -- time to first detection of failure
Scale -- equal load on each member, network load
Impossible to achieve perfect completeness and accuracy at the same time, so accuracy is partial
What is the best failure detector, what is the optimal
Swim failure detection algorithm, uses delegation to ask other nodes to do the ping
Dissemination options -- multicast, point-to-point, piggyback on pings, "infection" style dissemination
Suspect a failure before declaring it as such

Grids

Applications and Scheduling, Two-level scheduling, inter-site and intra-site protocols
Grids are federated, no single entity controls the entire infrastructure
In clouds the focus is on failures, scale, on-demand

Week 3

P2P Systems

First distributed systems that seriously focused on scalability with respect to number of nodes
Napster stores a directory, filenames with pointers to their location; no security, responsible for user' copyright violations
Centralized server is single point of failure, congestion
Gnutella
Eliminates the servers so clients search and retrieve among themselves
Five message types -- ping, pong, query, query-hit and push
Avoids excessive by keeping recently received messages
Problems -- ping-pong consumes 50% of traffic, repeated searches, free-loaders, flooding
FastTrak is a hybrid between Napster and Gnutella, uses supernodes to store file pointers, any peer can become supernode provides it has earned reputation
BitTorrent
Chord -- distributed hash table (DHS), uses Ring of peers to achieve O(log(N)) for memory and lookup latency
Peer pointers -- successors and finger table
Search under peer failure
Deal with dynamic changes, churn
Stabilization protocol and rounds
Pastry -- routing tables based on prefix matching, hops are short (in the underlying network)
Kelips -- k affinity groups, each node hashed to a group; filename hashed to a group, any node in the group may store the file

Week 4

Key-Value Stores

Distributed dictionary data structure, similar to DHT, sort of database
Today's workloads -- lots of random reads and writes, joins infrequent, foreign keys rarely needed, need to scale out not scale up
NoSQL => Not Only SQL, operations like get and put, may not have schema, don't always support joins or have foreign keys
NoSQL systems often use column-oriented storage
Cassandra uses a ring-based DHT but without finger tables or routing, replication strategy is SimpleStrategy or NetworkTopologyStrategy for multi-DC deployments
Snitch options --- IPs to racks and DCs maps (config file), SimpleSnitch (unaware), RackInferring, EC2Snitch
Writes go to Coordinator (per-key, per-client or per-query) that uses Partitioner to send query to all replica nodes responsible for the key
When replica is down the Coordinator writes to all other replicas and keeps the write locally until it comes back up
Writes on replica nodes go to commit log, then to memtable -- write-back cache as opposed to write-through; when full memtables are flushed to _SSTable (Sorted String Table); also uses Index file and Bloom filter; SSTables and logs nee dto be compacted
Deletes don't remove items right away, instead tombstone is written to the log, until compaction
Reads are similar to writes
CAP Theorem -- distributed systems can satisfy at most 2 out of 3 guarantees -- consistency , availability, partition tolerance
NoSQL databases provide eventual (weak) consistency, traditional RDBMS favors strong consistency over availability under a parition
Cassandra consistency levels -- ANY, ALL, ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM
Newer consistency levels -- causal consistency, per-key sequential, red-blue consistency, sequential consistency
HBase -- Google's BigTable, Yahoo open-sourced it

Time and Ordering

Time synchronization is required for correctness and fairness; processes in Internet-bases systems follow an asynchronous system model
Clock skew -- relative difference between clock values; clock drift -- relative difference between clock frequencies
Need to synchronize any pair of clocks at least once every: "maximum acceptable skew" / ( 2 * "maximum drift rare" )
External synchronization -- Christian's Algorithm and NTP; allowed to increase but should never decrease clock value; error can never be zero; can we avoid synchronizing clocks altogether and still be able to order events
Instead assign timestamps that are not absolute time -- as long as they obey causality that should work
Logical ordering proposed by Lamport in the 1970s, used in almost all distributed systems today
Logical relation Happens-Before defines partial ordering, not all events are related to each other
A process increments its counter when a send or an instruction happens; on receive the counter may be updated and is incremented
Lamport timestamps not guaranteed to be ordered or unequal for concurrent events
Vector timestamps -- on send or instruction event at process i only i-th element of the vector clock is incremented
VT1 = VT2, VT1 <= VT2, VT < VT2 (causally related), or NOT (VT1 <= VT2) and NOT (VT2 <= Vt1) (concurrent)
Obeys causality, uses more space but can also identify concurrent events

Week 5

Snapshots

Global snapshot = Global State = individual state of each process + individual state of each communication channel in the distributed system
Naive solution -- synchronize all clocks and asks processes to record their states at known time
Time synchronization always has errors
Does not capture state of messages in the channels
Synchronization is not required, causality is enough
System model -- N processes, two unidirectional channels between each ordered process pair, channels are FIFO, no failures, no transmission errors (corruption or duplication)
Snapshots should not interfere with the normal application actions
Chandy-Lamport global snapshot algorithm uses "marker" messages
Consistent cuts
Cut is a time frontier at each process and at each channel
Events that happen before the cut are in the cut, otherwise out of the cut
Consistent iff for each pair of events e,f if e is in the cut and f -> e, then f is also in the cut
Any run of the Chandy-Lamport global snapshot algorithm creates a consistent cut
Correctness in distributed systems:
Liveness -- something good will happen, eventually
Safety -- something bad will never happen
Difficult to satisfy both liveness and safety in an asynchronous distributed system

Multicast Ordering

Problem, D=definition (unicast, broadcast, multicast), applications; 3 protocols -- FIFO, causal and total ordering
FIFO ordering -- multicasts from each sender are received in the order they are sent
Causal ordering -- multicasts whole send events are causally related, must be received in the same causally-obeying order at all receivers
Causal ordering => FIFO ordering, but the reverse is not true
Total ordering -- ensures all processes receive all multicasts in the same order
Implementation
FIFO -- each receiver maintains a per-sender sequence number
Total -- one process elected as leader or sequencer that maintains the order
Causal -- each receiver maintains a vector of per-sender sequence numbers
Reliability of multicasting is orthogonal to ordering
To reliably send multicasts -- sender sequentially sends a reliable unicast message to all group recipients, and each receiver also sequentially sends the message to all group processes; not efficient but reliable
Virtual Synchrony combines membership protocol with a multicast protocol; can not implement consensus because VSync groups are susceptible to partitioning

Paxos

Problem -- a group of processes attempting to elect a leader, ensure mutually exclusive access to a resource, receive updates in the same order; coordinate and reach agreement
What is consensus -- N processes, each process has input variable xp and output variable yp, design a protocol so that at the end all processes set yp to all 0s or all 1s
Constraints -- validity, integrity, non-triviality
Two models of distributed systems
Synchronous -- each message is delivered within bounded time, local clock drift has known bound, each process takes finite time to complete
Asynchronous -- no bounds for process execution, drift rate of a clock is arbitrary, no bounds on message transmission delays
In the synchronous system model consensus is solvable
For a system with at most f processes that may crash the algorithm proceeds in f+1 rounds (with timeout), using reliable communication for all members
In the asynchronous model consensus is impossible to solve -- can not distinguish between failed process from one that is very very slow
Paxos does not solve the consensus problem but provides safety and eventual liveness
Paxos has rounds, each with a unique ballot id
Rounds are asynchronous, time synchronization is not required
Each round is broken into phases -- leader election, leader proposes a value and processes acknowledge (bill), leader multicasts final value (law)
What is the point of no-return
Fischer, Lynch, and Patterson, 1983 (FLP) proof

misho-kr/Cloud Computing Concepts.md