alright so I thought about this thing for a while but I think I got the Gist of it haha
Efficient syncing for huge lists, accounting for new items, changes and deletions.
Connection basis is between server and clients, one-to-many. Might be most efficient in websocket format, because of all the back-and-forth.
I find that the most difficult part of this problem is determining what the source of truth is, because the clients can request a sync at literally any time.
For syncing many items in a list, each item needs to have some sort of unique identifier. This could be a timestamp, UUID, maybe even a combination of the two. For the sake of this concept, let's justr say it's a combination of timestamp in ms since Unix epoch and a UUID, both individually available.
This approach for syncing uses a tree traversing method for eliminating the need for syncing items that haven't been deleted or haven't changed.
Okay, so, let's say, the server has many entries in this list. The client has none because it has never synced before.
- The client sends a
lastSync
timestamp of 0, indicating it has never synced before. Could maybe also be anull
value, but that's up to implementation. - The server checks its list for items which have a timestamp larger than client's
lastSync
, which should be all in the list. - The server sends these items.
- The client receives these items.
- The client saves its latest sync time as
lastSync
if the data was transmitted correctly.
ezpz
The server has many entries. The client does as well. The client initiates the sync. Important: the sync should account for past-deleted and past-changed items.
- The client calculates a checksum of all its items.
- The client sends this checksum to the server.
- The server calculates a checksum of all its items.
- The server compares the received checksum to the calculated one.
Here, multiple things could happen:
Nothing changed between the current client's version of the list and the server's. The server sends an acknowledgement (e.g. its calculated checksum).
Oh noes! Now what?
-
The server sends its checksum to the sync-requesting client.
-
The client sees the mismatch and splits the list into two.
-
The client calculates the checksums for both lists.
-
The client sends both checksums, along with a bit indicating which half of the split corresponds to which checksum. e.g.
{ "0": "7b163918-df86-4c2f-8a0c-36f510d716a2", "1": "c012e9ce-abb3-45db-b10c-0ffcbc0980aa" }
-
The server does the same for its list and compares those.
Again, there's multiple options that could be happening here.- Likely, when the list in question is in stack form (notably last-in) or ordered by timestamp from old to new, the checksum of the second half of the split won't match when the first doesn't, indicating deletion of an item in the first half.
If deletions only occurred in the second half and the checksum of the first half matches, the first half can logically be ignored. - If editing any item is possible and the checksum of one of the halves is changed, it is not possible to apply above logic.
- Likely, when the list in question is in stack form (notably last-in) or ordered by timestamp from old to new, the checksum of the second half of the split won't match when the first doesn't, indicating deletion of an item in the first half.
-
For each non-matching half, re-iterate the steps for this section until
- a given depth is reached, or
- the list is not splittable any further (checksums of individual items are reached).
The only difference is that the server and client should include the entire taken path on every next split, not re-sending known path checksums. e.g.
{ "00": "6435f174-d038-49d7-b1f6-798149317bb8", "01": "9240a64f-ffad-46d5-81d5-d90f6645161b", "10": "5b249bbb-dad6-4630-846a-8b7b79fe4e64", "11": "d26c5a2c-5200-49c8-ad32-2900b0552ee8" }
When the end of the tree traversing is reached, the server should figure out which items need to be sent. The most practical would be to send every item whose checksum isn't included in the client's. Data would include:
- the item
- optional: the item's checksum (mostly only practical when you're using a key/value store (key: checksum, value: item) instead of a list-esque format for transmission)
- optional: the item before this item's checksum (practical for ordering)
The client should then figure out the ordering of the items and update the items where needed, by unique identifier.
I think that's it. There's probably a more efficient way to sync lists, but this is what i came up with, and it seems decent I think