TAO

Geographically distributed, read-optimized, graph data store.
Favors availability and efficiency over consistency.
Developed by and used within Facebook (social graph).
Link to paper.

Before TAO

Facebook's servers directly accessed MySQL to read/write the social graph.
Memcache used as a look-aside cache.
Had several issue:
- Inefficient edge list - A key-value store is not a good design for storing a list of edges.
- Distributed Control Logic - In look-aside cache architecture, the control logic runs on the clients which increase the number of failure modes.
- Expensive Read-After-Write Consistency - Facebook used asynchronous master-slave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas.

TAO Data Model

Objects
- Typed nodes (type is denoted by otype)
- Identified by 64-bit integers.
- Contains data in the form of key-value pairs.
- Models users and repeatable actions (eg comments).
Associations
- Typed directed edges between objects (type is denoted by atype)
- Identified by source object id1, atype and destination object id2.
- Contains data in the form of key-value pairs.
- Contains a 32-bit time field.
- Models actions that happen at most once or records state transition (eg like)
- Often inverse association is also meaningful (eg like and liked by).

Query API

Support to create, retrieve, update and delete objects and associations.
Support to get all associations (assoc_get) or their count(assoc_count) based on starting node, time, index and limit parameters.

TAO Architecture

Storage Layer

Objects and associations stored in MySQL.
TAO API mapped to SQL queries.
Data divided into logical shards.
Objects bound to the shard for their lifetime(shard_id is embedded in id).
Associations stored on the shard of its id (for faster association query).

Caching Layer

Consists of multiple cache servers (together form a tier).
In memory, LRU cache stores objects, association lists, and association counts.
Write operation on association list with inverse involves writing 2 shards (for id1 and id2).
The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1.
Write failure leads to hanging associations which are repaired by an asynchronous job.

Leaders and Followers

A single, large tier is prone to hot spots and square growth in terms of all-to-all connections.
Cache split into 2 levels - one leader tier and multiple follower tiers.
Clients communicate only with the followers.
In the case of read miss/write, followers forward the request to the leader which connects to the storage layer.
Eventual consistency maintained by serving cache maintenance messages from leaders to followers.
Object update in leaders leads results in invalidation message to followers.
Leader sends refill message to notify about association write.
Leaders also serialize concurrent writes and mediates thundering herds.

Scaling Geographically

Since workload is read intensive, read misses are serviced locally at the expense of data freshness.
In the multi-region configuration, there are master-slave regions for each shard and each region has its own followers, leader, and database.
Database in the local region is a replica of the database in the master region.
In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database).
In the case of write, the leader in the local region would forward the request to database in the master region.

Optimisations

Caching Server

RAM is partitioned into arena to extend the lifetime of important data types.
For small, fixed-size items (eg association count), a direct 8-way associative cache is maintained to avoid the use of pointers.
Each atype is mapped to 16-bit value to reduce memory footprint.

Cache Sharding and Hot Spots

Load is balanced among followers through shard cloning(reads to a shard are served by multiple followers in a tier).
Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version.

High Degree Objects

In the case of assoc_count, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list).
For assoc_get query, only those associations are searched where time > object's creation time.

Failure Detection and Handling

Aggressive network timeouts to detect (potential) failed nodes.

Database Failure

In the case of master failure, one of the slaves take over automatically.
In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master.

Leader Failure

When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier).

Refill and Invalidation Failures

Refill and invalidation are sent asynchronously.
If the follower is not available, it is stored in leader's disk.
These messages will be lost in case of leader failure.
To maintain consistency, all the shards mapping to a failed leader are invalidated.

Follower Failure

Each TAO client is configured with a primary and backup follower tier.
In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier.
Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's refill or invalidate).

dharamgollapudi/TAO.md

TAO

Before TAO

TAO Data Model

Query API

TAO Architecture

Storage Layer

Caching Layer

Leaders and Followers

Scaling Geographically

Optimisations

Caching Server

Cache Sharding and Hot Spots

High Degree Objects

Failure Detection and Handling

Database Failure

Leader Failure

Refill and Invalidation Failures

Follower Failure