jpillora/peer-data-sync.md

Last active April 12, 2017 13:48

Star (4) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jpillora/6040172.js"></script>
Save jpillora/6040172 to your computer and use it in GitHub Desktop.

Download ZIP

A write up on Synchronising Data across a set of Networked Peers

Raw

peer-data-sync.md

Peer Data Synchronisation

One-way data synchronisation is trivial. Forward all mutating operations to your slave store to also perform them. Master-Slave replication complete.

Two-way data synchronisation - on the other hand - is hard. It involves conflict resolution, varying of complexity, from choosing the newest operation, most relavant operation, or using operational transforms. All of these require vaguely accurate time synchronisation which then produce another set of problems, solvable with comparing timers, using a shared time server, or using atomic clocks.

The current problem

Currently, I'm trying to synchronise session data across multiple authentication servers. I've added to the complexity of the problem by assuming that each node is transient and may only exist for a limited amount of time (i.e. AWS deciding they need to unplug my instance). Also, the connections between each node may fail.

The current solutions

The most common solution to this is Redis. Then possibly MongoDB. Redis would be the safest choice right now, it's stable and many people use it in production. By using Redis or Mongo however, we're adding an extra piece into an already complex puzzle. I think it'd be much simpler if the applications themselves could simply share a small chunk of memory?

The proposition

A node module which runs on a set of node applications, which will open a server and also become a client to each connected application, creating a network of peers, allowing them to synchronise shared "buckets" of data between them.

Goals

Maintain high performance
Allow arbitrary "backend" data-stores
Allow peers to join the network at any time and retrieve all data to date
Allow peers to ask for partial chunks of data from another specific peer

Here's what I've got so far: https://github.com/jpillora/node-peer-store though, definitely not production ready.

Questions

To the reader, I've got a few problems that I'm facing, please give your two-cents in the comments. Questions numbered for easy referencing :)

What are the bottlenecks and possible ways to solve them ?

With what I've currently build and tested, in one case, I found that a set of local servers (not a great test!), performs best at about 1.2k operations/sec. When you ramp it up though, it dives down into the 400-500 region. It would be nice to be able to detect where the bottleneck is and then attempt to balance system resources. Maybe, using compression to trade CPU for Network bandwidth, or using an in-memory data-store vs disk data-store for faster access for less capacity?

In this case, what's the best way resolve conflicts?

I'm currently attempting to synchronise time between each node (using system clocks, correcting for network latency), then when new data comes in, times are compared and the newer one wins.

Follow on question, what's the best way to synchronise time (while not being Google)?
When a new peer joins, during high volume, how do you splice them into the data flow efficiently and effectively?

I've drawn an algorithm for this though haven't implemented it. Basically, it's: join, grab all existing data, broadcast readiness, accept new data. However I'm going for 100% synchronisation and there are many gaps for packets to split by.

When a peer drops for a short time then reconnects, it will be missing some data, how do we retrieve only what's missing?

Currently, I'm storing a giant history array of all operations (local and remote), with time, operation and key, so at the moment, I can query everything between a given time period. The aforementioned algorithm will change this though, history will only store local operations, remote operations are just confirmed with a sequence number. So receiving a seq10 when you're expecting seq8 means, you missed seq8 and seq9. Will allow queries by sequence number.

I’m slightly stealing some of the nice ideas from the SLEEP specification, about using a History table to essentially allow for revisions, is there a better way?
I'd like any "backend" data-store behind this module, however each will come with different use cases, thoughts?
Scaling up. How do Google and Facebook solve these problems? So many uses hitting the same data, they have redundancy and replication, but how? Can it be applied here?

I'm guessing some form of eventual consistency... Currently, my set() method only calls back when it gets confirmation from all peers. Having more than 3-4 peers in a network would become very chatty, however, I'm sure I could do better somehow.

dominictarr commented Jul 19, 2013

This is a hard question... and depends a great deal on how your application is used in practice.
This also depends greatly on the meaning of the data - best to to avoid this entirely. For example, have append only data - where only one user/node is able to write to one set of data (like, in twitter, other people cannot write to your feed). letting the last writer win is another simple approach that works in many situations, but does have the clock skew problem.
The simplest and probably best way to synchronize time is probably to have a GPS on each node... GPS is based on synchronized clocks... otherwise, maybe you can ping-pong back and forth to pick a reasonable time...
see http://en.wikipedia.org/wiki/Network_Time_Protocol
(& 5 & 6) This is the heart of a data replication protocol. this paper describes a simple and effective approach http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf . I implemented this here https://github.com/dominictarr/scuttlebutt . There are more approaches, but this sounds like what you are looking for.
Checkout https://github.com/rvagg/node-levelup it's great because it can run in the browser too, see https://github.com/maxogden/level.js
Waiting for all the responses is not scalable. Although, it's doable within a datacenter.
Actually, what you are describing as your current approach sounds rather like zookeeper.
However, if you are intending to replicate this data into web browsers, or mobile devices, you'll want something that can be eventually consistent. This paper: http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf describes a range of eventually consistent data structures. It's very lengthy, but the ideas are very simple, and you can pick them up by skimming paper.
The thing you havn't mentioned is security. Inside a datacenter you could reasonably assume that you can trust all the nodes, since they are behind a firewall. Outside of that, anything can happen. Can nodes pretend to be other nodes, or pretend that another node's messages didn't exist? An approach more like git is potentially much more secure, but works on completely different principles.

Also, check out the dynamo paper http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf which explains how a bunch of these ideas are combined into a large scale system.

Author

jpillora commented Jul 20, 2013

@dominictarr, thanks for response:

True, I've been finding there is such a large range of use cases and many call for a completely different implementation.
For sharing session data, there aren't many cases where I need to conflict resolve, though here's one: a user hit's a page which triggers a modification of their session data, followed soon after by the user logging out, if this data is going across two data centres (AWS-A AWS-B, for example) and there is latency and bad luck, then the logout might appear first, so that's why I need to synchronise time reasonably well.
I guess it isn't too expensive to buy a GPS chip. Good idea, though in the context of a data centre, this isn't really feasible :(
I did briefly look at that paper and scuttlebutt, however since it's a "way of replicating
state that does not have strong consistency requirements", and I wanted it for small but important session data, I wanted to be sure that it was replicated. Seeing the added complexities of perfect consistency, it may be worth my time to just use "sticky sessions", so a given user would only require their data to be consistent on one node, with other nodes acting as fallbacks.
LevelUP is awesome, Rodd gave a talk at the Sydney node meetup which was very interesting. For this case, I wanted to get it working in memory first then I'd move onto persisting data, also as sesssions are quite transient, memory is a good fit at the moment. Going to be using it extensively in https://github.com/jpillora/node-king for all configuration and persisted state. On the reddit post @substack mentioned: https://github.com/dominictarr/level-merkle for "for something that works better with large amounts of data or a long history log", this seems quite promising :)
I imagined it wouldn't, I'll most likely going between data centres, so probably in the 10-15ms area. Not what I'm going to do, though with 10 connected nodes, would each operation would trigger 10x the incoming data size to be put back on the network, maybe it could be workable, though yes, it seems like I should be moving towards eventual consistency and make the application work around that.
When using dnode, I generally just think to myself "oh, I'll just add TLS later", though yes, in reality it needs to be carefully thought out. Currently my security is TCP, the fact that your can't spoof your IP and that each node is identified in memory by IP. A rogue node could always jump in and say "DELETE EVERYTHING" and I'd be screwed through that's a future problem, most likely solved with preshared keys.

Thanks for links, will look at both papers tonight, the Dynamo one looks really interesting :)

Author

jpillora commented Jul 20, 2013

Also, I forgot to ask, with scuttlebutt, it seems like the single point of failure is the server? Like clients can go down and reconnect and that's fine, though what if the server dies? Should I use it with client and server at both ends? Finally, how do I ensure that data has been replicated (on at least one node)?

Basically I'm trying to keep http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing in mind

jpillora/peer-data-sync.md

Peer Data Synchronisation

The current problem

The current solutions

The proposition

Goals

Questions

dominictarr commented Jul 19, 2013

Uh oh!

jpillora commented Jul 20, 2013

Uh oh!

jpillora commented Jul 20, 2013

Uh oh!