- Geographically distributed, read-optimized, graph data store.
- Favors availability and efficiency over consistency.
- Developed by and used within Facebook (social graph).
- Link to paper.
- Facebook's servers directly accessed MySQL to read/write the social graph.
- Memcache used as a look-aside cache.
- Had several issue:
- Inefficient edge list - A key-value store is not a good design for storing a list of edges.
- Distributed Control Logic - In look-aside cache architecture, the control logic runs on the clients which increase the number of failure modes.
- Expensive Read-After-Write Consistency - Facebook used asynchronous master-slave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas.
-
Objects
- Typed nodes (type is denoted by
otype
) - Identified by 64-bit integers.
- Contains data in the form of key-value pairs.
- Models users and repeatable actions (eg
comments
).
- Typed nodes (type is denoted by
-
Associations
- Typed directed edges between objects (type is denoted by
atype
) - Identified by source object
id1
,atype
and destination objectid2
. - Contains data in the form of key-value pairs.
- Contains a 32-bit time field.
- Models actions that happen at most once or records state transition (eg
like
) - Often
inverse association
is also meaningful (eglike
andliked by
).
- Typed directed edges between objects (type is denoted by
- Support to create, retrieve, update and delete objects and associations.
- Support to get all associations (
assoc_get
) or their count(assoc_count
) based on starting node, time, index and limit parameters.
- Objects and associations stored in MySQL.
- TAO API mapped to SQL queries.
- Data divided into logical shards.
- Objects bound to the shard for their lifetime(
shard_id
is embedded inid
). - Associations stored on the shard of its
id
(for faster association query).
- Consists of multiple cache servers (together form a
tier
). - In memory, LRU cache stores objects, association lists, and association counts.
- Write operation on association list with inverse involves writing 2 shards (for
id1
andid2
). - The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1.
- Write failure leads to hanging associations which are repaired by an asynchronous job.
- A single, large tier is prone to hot spots and square growth in terms of all-to-all connections.
- Cache split into 2 levels - one leader tier and multiple follower tiers.
- Clients communicate only with the followers.
- In the case of read miss/write, followers forward the request to the leader which connects to the storage layer.
- Eventual consistency maintained by serving cache maintenance messages from leaders to followers.
- Object update in leaders leads results in
invalidation message
to followers. - Leader sends
refill message
to notify about association write. - Leaders also serialize concurrent writes and mediates thundering herds.
- Since workload is read intensive, read misses are serviced locally at the expense of data freshness.
- In the multi-region configuration, there are master-slave regions for each shard and each region has its own followers, leader, and database.
- Database in the local region is a replica of the database in the master region.
- In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database).
- In the case of write, the leader in the local region would forward the request to database in the master region.
- RAM is partitioned into
arena
to extend the lifetime of important data types. - For small, fixed-size items (eg association count), a direct 8-way associative cache is maintained to avoid the use of pointers.
- Each
atype
is mapped to 16-bit value to reduce memory footprint.
- Load is balanced among followers through
shard cloning
(reads to a shard are served by multiple followers in a tier). - Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version.
- In the case of
assoc_count
, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list). - For
assoc_get
query, only those associations are searched where time > object's creation time.
- Aggressive network timeouts to detect (potential) failed nodes.
- In the case of master failure, one of the slaves take over automatically.
- In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master.
- When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier).
- Refill and invalidation are sent asynchronously.
- If the follower is not available, it is stored in leader's disk.
- These messages will be lost in case of leader failure.
- To maintain consistency, all the shards mapping to a failed leader are invalidated.
- Each TAO client is configured with a primary and backup follower tier.
- In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier.
- Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's
refill
orinvalidate
).