data intensive apps are built from building blocks:
- db - store data to find it again later
- caches - remember res of expensive calculation
- search indexes - allow users to search via keyword or other ways
- stream processing send msg to another process to be handled async
- batch processing - periodically crunch a bunch of accumulated data.
- how do you make sure data remains correct and complete, even when things go wrong internally?
- how do you provide consistently good performance, even when parts of the system are degraded?
- how do you handle an increase in load?
The tenants
- Reliability - should still work as-expected in the face of adversity (software/hardware faults + human error)
- Scalability - should be able to deal with growth as-expected
- Maintainability - everyone should be able to work on the system productively
faults - things that can go wrong...a system that can cope w this is resilient or 'fault-tolerant'. a fault's not the same thing as failure:
- fault = a component of system fails
- failure = you're screwed, system fails
Faults are gonna happen, so you should design a system to allow it to keep going even when faults inevitably occur.
- first response = when a phys piece of hardware goes down, just add more of the same thing to the system...aka redundancy, this usually works. however, as we've got way more data/demands on our machines, hardware faults become more and more a thing (+ VM services like AWS having VMs go down commonly as it prioritizes system flexibility over single machine reliability) this means the move is to consider hardware faults and manage them not just via hardware redundancy but at the software level.
- by doing so, you can route traffic to another machine while one machine has say, planned downtime/maintenance
the best systems do this:
- design systems to minimize opps for err - ex: well-designed abstractions, APIs that make it easy for ppl to 'do the right thing'...but this is more an art than a science
- decouple places where ppl make most mistakes from places where they shouldn't - aka 'sandbox' envs...places where ppl can experiment, test, etc outside of prod (local/staging/dev/etc)
- testing - unit/int/e2e
- allow quick & easy recovery from errs - rollbacks for config changes, incremental additions via CI, Github rollbacks, etc
- monitoring - performance metrics and error rates
- good management in general
coping w increased load...but how do you describe load?
load parameters - actual terms of load, ex: requests per second, ratio of reads/writes to db, cache hit rate, etc. (ex on p.12)
once you understand the load on your system, you can understand what happens when it increases in 2 ways:
- when you increase a load param and keep the existing system resources unchanged (CPU, memory, network bandwidth) how is the performance affected?
- when you increase a load param, how much do you need to increase the resources so that the performance is unchanged?
performance description examples
- batch processing system - we care about throughput aka the amount of stuff that can be processed per second
- most web services - we care about client/server req/response time
Latency and response time = lumped together, but NOT the same!
- latency = duration that a req is waiting to be handled, during which it is latent or awaiting service
- response time - the actual time to process the req
How do we measure response times? If you're thinking of just grabbing a ton of req/res and finding the average, that's not the best idea. Outliers in averages skew the data and aren't representative of a user's actual experience. Instead, line them up from lowest-highest response times and calculate percentiles, starting with the the median (aka the 50th percentile)...this will show what response time is most likely for users.
Likewise, percentiles allow you to look at the outliers more closely and target them to be fixed: higher percentiles (ex: 95%, 99%, 99.99%), aka 'tail latencies', can be extremely important to look at as those scenarios are when users are experiencing big pain points, often w a ton of data (users with a ton of data are likely long-standing users of your stuff...you wanna keep those folks happy).
ex: Amazon describes response time requirements in terms of 99.9th percentile or 1 in 1,000 requests...this is bc the customers w the slowest requests are often those w the most data on their accounts as they've made a ton of purchases (keep them happy!!)
percentiles are used in SLOs and SLAs to define expected performance. Violation of metrics defined in SLA means the using 3rd party can legally ask for a refund or whatever is stipulated in the SLA...super important stuff.
p.17 - figure 1-5 is helpful
- scale up - vertical scaling, add more power to existing machines
- scale out - horiz scaling, add more machines
in general: keep your system to a single machine until you absolutely need to do otherwise, as this introduces a lot more complexity.
Most apps are built by layering one data model on top of another. ex:
- app dev looks at real world business domain (ppl, orgs, goods, actions, money flows, etc) and model it in terms of objects or data structures and APIs that manipulate those structures...this is all specific to your app and what its trying to do
- when you wanna store those structures in a DB, you do so often via JSON docs or tables in a relational DB or via a graph model
- engs who built your DB software decided on representing the above data in terms of bytes in memory, on disk, or on a network...this representation may allow the data to be queried, searched, manipulated, or processed in various ways
- on an even lower level, hardware engs have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, etc.
^ each layer hides the complexity below it.
Relational DBs came about via 'business data processing' done by mainframe comps in the 60's and 70's: transaction processing (sales/banking transactions, airline reservations, etc) and batch processing (customer invoicing, payroll, etc)...other DBs at the term forced app devs to think about the internal representation of the data in the DB...relational model = hid this behind a cleaner interface
Relational has held up well ofc.
NoSQL - storing all stuff together via JSON/XML...came thru in 2010s, why?
- a need for greater scalability than relational DBs can easily achieve, ex: very large datasets or very high write throughput
- ppl wanting open source solutions rather than proprietary DB software
- frustration with the restrictiveness of relational schemas
impedance mismatch - app dev is done in json/xml and then needs to a translation layer to go into a relational DB's tables, rows, and columns
^ hence the desire for the document model (noSQL). Also, having everything in a single JSON object reduces the need to go look up stuff in diff places (ex: consider multiple joins in diff tables for a query)...but this locality comes at the expense of then not being able to link lots of initially unrelated stuff together, and this is one of those areas where relational model reeeeally shines.
Given the rise in popularity of document model, relational DB solutions have introduced document model-esque datatypes, ex: JSONB in postgreSQL
if the chosen DB doesn't support JOINs, you'll have to do so at the application level, resulting in multiple queries to the DB. This becomes even more likely as you introduce more features to your app, as data has a tendency to become more connected.
For ex: maybe in building your own linkedIn, you want to have individual pages for schools rather than just having them as individual string values in person profiles? you'd want organizations and schools as actual entities in your DB just as people...
also consider recommendations: a user can write a recc for another user and its placed in the user's resume...if the user who wrote the recc changes his profile pic, you want to see this reflected in the inital users CV.
^ What we're talking about here is many-to-one relationships: many ppl live in one particular city, many ppl go to one particular school, many ppl can write a recommendation to one particular person)....
"The main args in favor of the document model are schema flexibility, better performance due to data locality, and for some apps it's closer to the data structures used by the app (reduced impedance mismatch). The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships."
Which is best and makes life simpler for me? Realistically, that dpends on the relationship that exists b/w data items in your app. If it's more connected, the document model just becomes awkward and the relational model (and the graph model for that matter)...is more accepted.
most document DBs dont enforce a schema on the data in documents. This means that arbitrary keys/values can be added to a document and when reading, clients have no guarantee as to what fields the documents can contain.
Document DBs are sometimes called 'schemaless' but that's misleading -- the code that reads the data which comes back from the DB assumes some sort of structure, so this is basically an 'implicit structure'...so a more accurate term for document DBs is 'schema-on-read', as you can write anything to it but there is an assumed structure upon read.
schema-on-read is advantageous i some scenarios:
- if the item in the collection dont all have the same structure for some reason, ex:
- the are many diff types of objects and its not practical to put each type of object in its own table
- the structure of the data is controlled by external systems which you dont control and may change at any time
^ in scenarios like these, a schema can hurt more than help, and schemaless documents are a more natural data model. But in cases where records need the same structure, schemas are useful mechanism for enforcing it.
a document is usually stored as a single string, encoded as JSON or XML. If your app needs to access the entire doc, there's an obvious performance advantage for the storage locality of a document DB. If data's split across multiple tables, this may require more disk seeks and take more time.
This advantage is only a thing if you need large parts of the document data all at once.
Heads up: Document and Relational models have converged...instead of seeing them at odds, they really do complement each other, ex:
- consider JSONB datatypes in relational DBs
- some document DBs now support relational-like JOINs in their query language
before SQL, query languages were pretty much imperative and you had to have a solid understanding of your DB, your data, and how it worked. SQL introduced a declarative way of querying your DB, so you state the 'what' should be done rather than "what" and "how"..this abstraction allows you to not have to think about optimizing the queries, as SQL can introduce any improvements it wishes without the end user needing to know.
Something I did not think about: Declarative languages also lend themselves to parallel execution!!!!
- imperative languages specify what should be done in what order --> declarative languages have a better chance to be performed in parallel execution bc they only specify what needs to be done, not the algo to do it...this frees the DB to use a parallel implementation of the query language, if appropriate.
if your app has mostly one-to-many relationships (tree-structured data) or no relationships b/w records, document model is appropriate. BUT...what if many-to-many relationships are common in your data? The relational model can handle simple cases of this, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a GRAPH.
Graphs consist of 2 kinds of objects:
- Vertices - aka 'nodes' or entities -
- Edges - aka relationships
ex:
- social graphs - vertices are ppl, edges indicate who knows who
- web graph - vertices are web pages, edges indicate HTML links to other pages
algos can work really well on graphs - ex:
- determine the shortest point between 2 paths on a road network
- PageRank can use the web graph to determine web page popularity for search result ranking
^ In these examples, data is homogenous...data all represents the same thing (ex: each vertex is a webpage or each vertex is a road junction), but graphs can also represent different types of data and their relationships within a single datastore, ex:
- Facebook maintains a single graph w many diff vertice types and edges:
- Vertices: ppl, locations, events, check-ins, user comments
- Edges: who's friends w who, which check-in happened at what location, who attended what event, etc
Let's look at one way: property graphs.
In a property graph model:
- Each vertex contains:
- unique identifier
- a set of outgoing edges (who do I relate to?)
- a set of incoming edges (who relates to me?)
- A collection of properties (key/value pairs)
- Each edge contains:
- unique identifier
- The vertex at which the edge starts (the 'tail vertex')
- The vertex at which the edge ends (the 'head vertex')
- A label to describe the relationship b/w the 2 vertices
- A collection of properties
^ Think of a graph store as consisting of 2 relational tables, one for vertices and one for edges:
CREATE TABLE vertices (
vertex_id integer PRIMARY KEY,
properties json
);
CREATE TABLE edges (
edge_id integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices (vertex_id),
head_vertex integer REFERENCES vertices (vertex_id),
label text,
properties json
);
CREATE INDEX edges_tails ON edges (tail_vertex)
CREATE INDEX edges_heads ON edges (head_vertex)
Here's some really important aspects of the graph model:
- Any vertex can have an edge connecting it with any other vertex. There's no schema that restricts what can/cannot be associated
- Given any vertex, you can efficiently find both its incoming AND outgoing edges and thus traverse the graph...this is why we created the
edges_tails
andedges_heads
indexes above - by using different
label
s for different kinds of relationships, you can store several diff kinds of info in a single graph, which maintaining a clean data model.
^ This makes graphs crazy flexible for data modeling, which is really helpful for stuff that would be difficult to model in a relational way, ex: diff regional structures in diff countries, quirks of history like a country within a country, varying granularity of data .
Graphs are great for evolvability: as you add features to your app, a graph can easily be extended to accommodate the changes, ex:
- say you've got a social graph for 2 diff people: David and Lucy...you could have a ton of granular data about where lucy is from and her interests and education, not much there for david, but no worries...as time goes on, you could also use this 'social' graph to have food allergen entities that can be linked to David and/or Lucy based on their own allergies. Afterwards, you could link this type to foods that they could eat together without having to worry about adverse effects, and you could overlay this data on menus from local restaurants based on their location....in general, graphs become really easily to link unrelated data and extend. P cool.
Graphs are often queried with the Cypher Query language rather than SQL...while you can model a graph within a relational DB, querying it via SQL can be like going around your ass to get to your elbow. Cypher was created specifically for Graph querying...use the right tool (p.52-55)
Summary on p.63 is great.
When a data format or schema changes, a change to the app code is next to occur (ex: schema migrations in relational DBS)..but in a large app, these things don't just happen automatically:
- Staged rollout - deploying a new version of a server-side app to a few nodes at a time, checking to see the new version is rolling smoothly, and working your way to deploying to all nodes...this alls new versions to be deployed w/o service downtime.
since old and new versions of the code may exist at the same time, we need compatibility in both directions:
- backward compatibility - new code can read data written by older code
- usually easier bc you know what the old code is and you have context
- forward compatibility - vice versa -
- this can be trickier bc it requires the old code to ignore additions made by new code
In memory, data is kept in objects, structs, etc...when we wanna send it over the wire, we have to encode this data in a universal format (JSON, XML)...to translate one format to another (from in-memory representation to byte sequence) is called encoding/decoding, also known as marshalling/unmarshalling.
JSON note - JSON distinguishes strings and nums but it does not distinguish integers and floating-point numbers, and it doesn't specify a precision!!!!
JSON/XML has solid support for Unicode character strings (ex: human readable text) but otherwise for binary strings (ex: images) you're kinda screwed...to get around this, ppl encode binary data via base64.
Thought: Data often outlives code - the data can stay the same for years on end, though the code written around it can change wildly.
Service - an API exposed by the server
REST - not a protocol, but a design philosophy that builds upon the principles of HTTP:
- simple data formats
- using URLS for identifying resources
- using HTTP features for auth, cache control, content-type negotiation
an API designed according to REST principles is considered RESTful.
RPC model - tries to make a req to a remote network service look the same as calling a fn in your local language (ex: http reqs in JS), this can be problematic bc:
- you ultimately dont control what happens outside your realm
- a local fn either returns res or an err or doesn't return at all (ex: infinite loop), but in calling a remote resource, a timeout could occur...you just dont know what happened at all, as you get no response from the resource
- idempotence issues could arise (ex: what if I accidentally transfer $200 to person X twice?)
- RPC fns take much longer than local fns
gRPC - an RPC implementation using protocol buffers. gRPC supports streams where a call consists of not just one req/res but a series of them over time....
Custom RPC protocols w binary encoding format can achieve better performance than something generic like JSON over REST...BUT a RESTful API has other advantages: widely supported, vast ecosystem of tools, cURL it, etc.
message-broker (aka message queue) - stores a request (a message) temporarily Ex: Kafka, NSQ
advantages over RPC:
- can act as a buffer if the recipient (ex: bitly2 service) is temporarily unavailable or overloaded, which improves the system reliability
- can automatically re-deliver message to a previously-crashed process
- allows one message to easily go to any number of recipients
- avoids having to expose the IP address and port number of the recipient(s)
- decouples the sender from the recipient
^ this is async: the sender just sends the message and forgets about it (no res needed)
how this actually works:
- one process sends a message to a named queue or topic
- broker ensures that message is delivered to one or more subscribers/consumers to that topic (there can be many producers and consumers of atpic) ^ 1-way dataflow, but a consumer can itself publish messages to another topic or to a reply queue that is consumed by the sender of the original message (allowing a req/res dataflow, if neccesary).
- scalability - if data volume or read/write load grows bigger than a single machine, spread it out across multiple machines
- fault tolerance/high availability - your app needs to keep working even if one/multiple nodes (or even a whole datacenter if needed) goes down...when one fails, another should be able to take over.
- latency - if you've got servers around the world, you might wanna spread servers to various locations to reduce network packet wait times
'shared-nothing architectures' aka horizontal scaling - no special hardware is required here, so you can use whatever makes the most financial sense per performance. This allows you to use machines across the world to reduce latency and survive entire datacenter outages.
shared-nothing architectures can be complex and they require a lot of caution, bc coordination has to happen at the software layer (i.e. 'it's your problem now!')
- replication - keeping a copy of the same data on multiple nodes (redundancy = safety...can also improve perf)
- partitioning - split a big DB into subsets/partition so that each can be assigned to diff nodes (aka 'sharding')
These 2 notions often go hand-in-hand...you can have partition replicas in diff nodes.
aka 'keeping a copy on multiple machines'
why?
- keep data geographically close to your users
- allow system to keep working when some parts fail (aka increase availability)
- scale the num of machines that can serve read queries (aka increase throughput)
if the data never changes, no biggie, just replicate the data to all your nodes and call it a day. However, if that's not the case (prob isn't), you gotta consider HOW you'll handle changes to the data across all of your nodes..this is where it gets tricky
There's 3 popular algos for replicating data changes across nodes, and almost all distributed DBs use of of these approaches:
- single-leader
- multi-leader
- leaderless
replica - a node that stores a copy of the DB
every DB write needs to be processed by every replica..most common solution to do this is leader-based replication:
- one replica is designated as leader, when clients wanna write to the DB, the must send their reqs to the leader, which first writes the new data
- all other replicas are followers and they just do what the leader does..when the leader writes new data to its DB, it sends the data change to all of its followers via a 'replication log', and the write the data to their own local DBs in the order described in the log.
- while writes are only via the leader, reads can come from either
^ This mode of replication is built-into a lot of relational DBs like postgreSQL, and isn't just restricted to DBs either: message brokers like kafka can do the same thing.
But when does replication occur? Does this happen sync or async?
- sync - ex: client writes to leader db, client has to wait till all replicas have updated before getting a response
- async - client writes to a leader db and gets a res back after leader has written successfully (dont have to wait for replicas)
obviously, there's a speed advantage to being async, but what if the leader goes down in that req? You're kinda screwed. In this case, a sync replication works better bc we know the data is correct across all nodes. However, in a sync replication scenario, if a single node goes down, you're screwed (ex: I've got 1 leader, 4 followers and the first follower goes down so no response happens and no other nodes get updated).
Bc of the above it's impractical for ALL followers to be synchronous (any node outage borks the entire system)...we can have a middle ground here: what if just one follower is synchronous and the rest are async? If this sync follower goes down or is slow, we convert an async follower into a sync one. <-- This guarantees that we always have an up-to-date copy of the data on at least 2 nodes: leader and 1 sync follower....This is called semi-synchronous replication.
OFTEN - leader-based replication is completely async -- this has its own problems: if the leader fails and isn't recoverable, any writes that haven't been replicated are lost. BUT async configurations mean that a leader can keep on processing writes even if followers have fallen behind.
How can you do this without downtime?
Just copying data over from one node to another doesn't work, bc you could have writes happen at this time that would be lost as a standard file copy might not have the updated data. You could lock the leader DB so no writes could occur, but that would severely limit its availability.
Here's how this typically works:
- take a snapshot of the leader's DB at a point in time
- copy the snapshot to the new follower node
- follower connects to the leader and requests all data changes since the snapshot occurred. This means that the snapshot is associated with an exact position in the leader's replication log (called a 'log sequence number' in PostgreSQL)
- When new follower has processed backlog of data changes since the snapshot, it's fully caught up.
we wanna make sure the system stays good even if nodes fail, how do we ensure high availability with leader-based replication?
each follower has a log of changes it's gotten from the leader. If something goes south, it can then (upon restart) reconnect to the leader and request all data changes from the leader since the last record in its own changelog
failover = current leader blows it, another node has to become the new leader...this can occur manually or automatically...
automatic failover steps:
- determine leader has failed - if a node doesn't respond to another in an allotted amount of time, it's presumed dead
- choose new leader - best candidate is a follower w the most up-to-date data..getting all nodes to agree on the new leader is a consensus problem
- reconfigure system to use new leader - send write reqs to new leader. If new leader comes back for some reason, the system needs to ensure the new leader is put in its place as a now-follower.
A lot of issues can occur w failover:
- if we're doing async replication (prob), the new leader might not have the most up-to-date data from the old leader. If former leader returns, what happens to all of his writes the didn't get taken care of? Most commonly ppl will just dump the old leaders writes...which is kinda messed up from a durability perspective
- discarding writes can be really problematic if DB writes need to be coordinated w an outside system, ex at github:
- out-of-date follower got promoted as a leader, DB used an auto-incrementor to assign primary keys to new rows but bc new leader's counter lagged behind the old leader, it reused previously assigned keys...meaning that some users got read access to other users' data 😱
- discarding writes can be really problematic if DB writes need to be coordinated w an outside system, ex at github:
- What's the right timeout to actually decide that the leader is dead?
Several methods:
-
Statement-based replication - every write req (aka statement) a leader DB executes is written to a log and sent to followers (ex: every INSERT/UPDATE/DELETE)...sounds good, but this can be problematic
- any statement that uses a non-deterministic (never the same every time) fn like
NOW()
could get diff values on each replica - If statements depend on other data in the DB (ex:
UPDATE WHERE condition
), they must be executed in exactly the same order on each replica or else there can be a diff effect - statements that have side effects (triggers, procedures, etc) may have unwanted side effects if they are not completely deterministic.
- any statement that uses a non-deterministic (never the same every time) fn like
-
Write-ahead log shipping (WAL) - DB storage engines represent data on a disk and usually every write is appended to a log:
- log-structured storage engine - log is the main place for storage log segments are compacted and garbage-collected in the bg
- B-tree storage engine, this overwrites individual disk blocks, so every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.
^ Regardless of case, the log is an append-only sequence of bytes of writes to a DB. We can use the existing DB write log to build a replica on another node: besides writing the log to a disk, it also sends the log to its followers...this is used in PostgreSQL
Disadvantage to WAL method - WAL log contains details of which bytes were changed in which disk blocks...this closely couples the replication details to the storage engine...if the DB changes its storage format to another version, its often not possible to run diff. versions of DB software on the leader and the followers.
- Logical (row-based) replication - diff logs for replication and for the DB storage engine...this is call a 'logical log'..it's basically a sequence of records describing DB writes to tables at row granularity:
- inserted row? log contains all new vals for all columns
- deleted row? log contains enough info to determine specific row deleted (ex: primary key)
- updated row? log contains enough info to detect chosen row and all columns' updated vals
Since the logical log is decoupled from the actual DB storage engine, this means nodes can run on diff versions of the DB software unbothered (allowing backwards compatibility)
"eventual consistency" - all follower nodes will eventually catch up with the data state of the leader.
Ofc, there can be replication lag and heres a few problems that can occur from it:
- read your own writes - if you write to a db via the leader and then try to read it, the read can come from a leader or a follower..since there's more followers than the leader, it'll likely come from a follower which may not have the new data replicated on its machine. Here we need read-write consistency, meaning that if a user writes to a db, the read should always contain this new data (ex: user reloads the page).
- when reading something the user may have modified, always use the leader, else use a follower...this means you actually gotta think about what may have been modified without actually querying it
- ex: reading profile info on social media site? read from leader bc they just may have edited it
- you could also track the time from the last update and use that to decide whether to read from the leader or follower
- you could also monitor the replication lag on followers and prevent any queries on followers that are X time behind the leader
- when reading something the user may have modified, always use the leader, else use a follower...this means you actually gotta think about what may have been modified without actually querying it
- Monotonic reads - make sure you only see data chronologically - ex: Imagine you write a comment on a post, then someone else writes one after you..if there's multiple read queries to followers, one may have a lag and another follower may have even more lag, so you may see the other persons data first then your comment and then none at all...you want to make sure the data is shown consistently according to what has happened when.
- How to achieve this? make sure that each user always makes reads from the sample replica (ex: replica is chosen based on a hash of the user ID)
- Consistent Prefix Reads - if a sequence of writes happens in a certain order, the corresponding reads will be seen in that same order...this can be a problem in sharded DBs, where a related sequence of writes happens in diff partitions/shards, so there is no global ordering of writes....when a user reads from the DB, they may see some parts of the DB in an older state and some in a newer state
- Ex: imagine you see couple having a conversation where the replies come before the questions, this is that kind of situation.
- solution? Write to the same partition if possible...but in some apps, this cant be done efficiently...you'll have to have an algo to explicitly keep track of these 'causal dependencies'.
when working with an 'eventually consistent' system, we need to think about the app behaves in various replication lag scenarios: a few mins/hours/etc?...if the result is a bad experience, we need to think about designing a system to provide a stronger guarantee (ex: read-after-write)
It'd be better if app devs dont have to worry about this at all, and this is why transactions exist....this helps DBs provide stronger guarantees all around.
having a single leader introduces a single point of failure in the system, so the natural thought is what if we have multiple leaders, ex: "1 leader per datacenter, each leader talks w the others in other datacenters to catch up"...this has some interesting points:
- performance: allow a better performance from using datacenters from where users are geographically-based
- datacenter outages: you'll always have backups
- network issues: traffic b/w datacenters goes through the public internet: having everything contained in one datacenter circumvents this issue so writes can happen via a leader inside a datacenter without waiting
The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required, ex: "user1 changes A -->B and user2 changes A -->C...if one write happens on one leader each, is replicated locally then broadcast, a conflict will occur."...in a regs single-leader approach, user2 would have to wait for user1 and here we could make the conflict detection synchronous (i.e. wait for the write to replicated to all replicas before providing a successful response) but this defeats the purpose of using multi-leader replication (to accept writes and keep it moving)...in this case, you could've just used single-leader method.
how can we help this?
- Avoid the conflict entirely: make sure users always go to the same leader, same datacenter for reads/writes (ex: based on geographic location)...but then again, what if a user goes to another location closer to another datacenter? they'd need to be moved and now we cannot avoid the potential of concurrent writes on diff leaders :(
- you could give each write a unique ID (ex: timestamp, UUID), pick the write with the highest ID as the winner and throw away the other writes. If timestamp, this is known as 'last write wins '
- There's plenty of other ways (p173)
client sends write to several replicas OR coordinator node does so on behalf of client. Unlike a leader DB, coordinator doesn't enforce a particular ordering of writes...this is wild...so when a read occurs, the read request goes to ALL NODES, ex: "I've got 3 nodes, I send a read req to all 3, if node 2 has value 'hi' and nodes 1 and 3 are 'hello' then we can assert that 1 and 3 are correct and 2 is stale data.
How do we fix stale data? All data should be copied to every replica right? If a node goes down, how does it catch up?
- read repair - on read (ex above, change node 2 to 'hello')...works really well for values that are frequently read
- anti-entropy process: some datastores have a bg process that constantly looks for differences in data b/w replicas and copies any missing data from one replica to another. Note: this doesn't copy writes in any particular order and there may be a long delay before data is copied
^ So thats reads..what about writes? this is based on if a majority of the nodes have the same data (similar to above)
leader-based replication - db typically exposes metrics for replication lag that you can feed into a monitoring system. We can do this bc writes happen on the leader and followers in the same order and each node has a position in the replication log so you can see the diff between the leader and a follower's position to measure the amount of replication lag. Leaderless replication - theres no order where writes are applied + if the db only uses read-repair (has no bg anti-entropy process) the data could be suuuper old...no good answer here.
DBs w appropriately configured quorums can tolerate individual node failures without having to do failovers (i.e. 'appoint a new leader'). They can also deal w individual nodes going slower bc reqs dont have to wait for ALL the nodes to respond, they can return when the necessary 'read' amount or 'write' amount of nodes have responded.
For this case, leaderless replication can be a good choice for a uses cases of high availability and low latency AND if you can tolerate occasional stale reads.
But this still has some issues: imagine if a series of nodes become unavailable to a single client, preventing quorum from occurring? Other clients can still access it, but to the first client...what can they do?... idk about this section...
sharding - breaking big datasets into partitions
why do we do this? Scalability. Diff partitions can be put on diff nodes in a shared-nothing cluster. Ex:
- queries on on a single partition - each node can independently execute the queries, so query throughput can be scaled by aded more nodes
- queries on multiple partitions - can be potentially parallelized out across many nodes, but this becomes much, much harder (ex: I'm trying to do a query on data across 4 diff partitions)
partitioning is often combined with replication so data is split up and redundant'ed for fault tolerance.
A node can store more than one partition.
peep p201 diagram to see a leader-follower replication model with replication + partition....everything we talked about re: replication applies to partitioning as well (ex: I have 3 copies of partition 2 spread across 3 diff nodes and I have a single leader-follower replication setup for these partitions)...some partitions are leaders and some are followers.
Our goal w partitioning = spread the data and query load evenly across nodes! ex: 10 nodes should be able to handle 10x as much read/write throughput of a single node.
if partitioning is unfair and not even, we call that being skewed, this makes partitioning less effective...a partition w a disproportionately high load is known as a hot spot.
how can we avoid hot spots?
- assign data randomly across nodes - problem = wanna find some data? you've got no clue which node its on
- split it up like a dictionary - note: partition boundaries need to map to the data, not the notion of how it should be split up (ex: dictionary as 1 book with A-B and another book with T, U, V, W,X,Y,Z as A-B is huuge and T-Z isn't)
Partition boundaries can either be chosen manually or via the DB itself.
Bc of the risk of skew and hot spots, many distributed datastores use a hash fn to determine the partition for a given key (each partition gets a range of hashes and every key whose hash falls within a partition's range will be stored in that partition).
^ technique is really good at distributing keys fairly among partitions...peep diagram on p.204
drawback of hash-based partitioning = we cannot as easily do range queries (ex: "give me all the data from x-y timestamps"...if we initially key off of a timestamp and partitioned the data based on that timestamp, its p easy to go find/grab that data...after all keys have been hashed/assigned...not so much).
^ Ex: in MongoDB, in hash-based sharding mode, any range query has to be sent to all partitions.
hash-based sharding is good but it doesn't exactly solve the issue of skewing/hot spots bc in the extreme case where all reads/writes are for the same key, you still end up w all requests hitting the same partition...ex: Imagine Paris Hilton's twitter in '05 is on a single partition and some scandal about her drops...you could have a massive hot spot problem here.
So far we've discussed a key-value data model of partitioning. If records are only access via their primary key, we can determine the partition and route read/write reqs to the associated one. But this becomes more tricky when secondary indexes get involved
- secondary index - instead of having an index by a primary key (ex: 191 --> { color: "red", make: "Honda", model: "Civic"} ) a secondary index contains methods to search specific stuff within your data (ex: "gimme all your red cars" --> [191, 306], or "find all articles with the word 'your mom' in it")
Unfortunately secondary indexes dont mesh too well with partitioning...here's some options:
you can create 'local' secondary indexes that are specific to the partition (gimme all your red cars in THIS partition) but maybe not all red cars live in one partition..maybe red cars are spread across several partitions (makes sense) and so then you'd have to send the query to ALL partitions and then combine all the results you get back.
This approach to querying a partitioned DB is known as scatter/gather and it can be really expensive. Most DBs recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition but as we see above that's not always possible.
Rather than each partition having its own local index, we can have a single global index that we then partition across nodes (i.e. one encyclopedia broken up across diff rooms). We can either index by a term (ex: color: black) or by a hashed value of a term, but ofc these come w tradeoffs re: range queries while partitioning via hash gives a more even distribution of load (it's really about your app-specific needs here).
- Pros - instead of doing scatter/gather across all partitions, you can do a range query going to just 1 partition
- Cons - writes are slower bc upon write, you then have to find and update the secondary global index (ex: "eh..which partition is that slice of the global index on?") and it might not be on the node in which you write the change to the DB
Ideally, we upon each write, the db's secondary index would be updated and we'd give a response. However, in a term-partitioned index that'd require a distributed transaction across all partitions affected by the write.
In real world practice, updates to global secondary indexes are often async.
over time, things change in a DB
- more ppl have more queries (throughput increases) so you wanna add more CPUs to handle the load
- dataset increases, you need to add more disks and RAM to store it
- some machine fails so another one needs to come thru to pick up the slack
^ This means that data and reqs need to be moved from one node to another = rebalancing
In rebalancing, we've still got some table-stakes requirements:
- after rebalancing, the load (reads/writes/storage) should be evenly distributed across nodes in a cluster
- while rebalancing, reads/writes should still be possible
- No more data than necessary should be moved between nodes (be efficient)
- Fixed num of partitions - simple method: instead of moving around the data b/w partitions, just create way more partitions than there are nodes so that when you need to add new nodes, it can grab some partitions easily and rebalance the cluster's load. The ONLY thing that changes is the assignment of partitions to nodes, not the amount of partitions or the assignment of keys to partitions.
- ex: 10 nodes w 1000 partitions split up evenly, then we add another 10 nodes, so now instead of 100 partitions/node we have 50 partitions/node
- Ideally, you can even account for mismatched hardware in your cluster: by assigning more partitions to more powerful nodes, you can force them to take a greater share of the load
This sounds nice in theory but a big problem is that if the data is variable (ex: "I didn't have much data but then my app took off and I have a ton now"), it's hard to calculate the number of total partitions from the start:
- if partitions are too big, rebalancing and recovery from node failures are expensive
- if too small, they incur too much overhead.
It's hard to find the sweet spot here.
- Dynamic partitioning - some DBs do this automatically..when a partition grows past its configured size, it's split into 2 partitions...in the flip side, if the data gets reduced past a given threshold, it gets merged into an adjacent partition.
- each partition is assigned to one node, each node can handle multiple partitions. After a partition is split into 2, the second partition goes to another node to balance load.
This partition amount flexibility is really nice..dynamic partitioning works really well with key-range partitioning (ex: encyclopedia-like: abr- apl) but also hash-partitioning.
- Partitioning proportionally to nodes Think about what we've got above:
- fix num of partitions - based on the size of the dataset
- dynamic partitioning - shrinks/expands based on the size of the dataset
In both of these scenarios, the num of partitions is completely independent to the number of nodes in the cluster. Another option is to make the num of partitions proportional to the num of nodes, i.e. have a fixed amount of partitions/node
- when a new node joins cluster - it randomly chooses a fixed num of existing partitions to split, then splitting half the data of each chosen partition.
Picking partition boundaries randomly requires hash-based partitioning (so the boundaries can be picked from the range of nums produced by the original hash fn)
Note: re: automatic vs manual rebalancing, it's not an either/or scenario and there's a fine gradient between the two, it makes sense to always allow for the possibility of manual rebalancing. Rebalancing is an expensive operation, requiring re-routing requests and moving a large amount of data from one node to another...if we don't do this carefully, this can overload the network or the nodes and mess up the performance of requests while rebalancing is in progress.
automating rebalancing can have unintended consequences in combination w automatic failure detection, ex:
- a node is overloaded, becomes temporarily slow to respond, nodes conclude its dead, rebalance cluster to spread its load elsewhere, putting additional work on the already-overloaded node and all other nodes + the network, making the situation worse
Always good to have the possibility for manual rebalancing to prevent operational surprises.
so we've chatted at length about rebalancing strategies, but when a client wants some data, how do they know which node they should go to now to get it?? Since partitions change locations, someone has to be in the loop to where they went ("I wanna read the key of 'foo' to get my data, so which port and IP address should I connect to now?")
This is an instance of a more general problem called service discovery (which isn't limited to just DBs).
here's some potential approaches to this problem (see diagram on p.215):
- client can hit up any node, if node doesn't have the primary key, it forwards the request to the appropriate node which processes + returns the response
- send all clients to a routing tier first, which determines the node responsible and forwards it accordingly
- require that clients be aware of the partitioning changes so they can always directly connect to the correct node
Many distributed data systems rely on a hybrid approach to num 2 above, using a separate coordination service (ex: Zookeeper) to keep track of the partitions moved to/from nodes along with their port and IP address..the coordination service is hit up by the routing tier and forwards the user to it.