Skip to content

Instantly share code, notes, and snippets.

@hexaclue
Created January 22, 2025 19:42
Show Gist options
  • Save hexaclue/7a2a06689515798e945dcb0321d45a02 to your computer and use it in GitHub Desktop.
Save hexaclue/7a2a06689515798e945dcb0321d45a02 to your computer and use it in GitHub Desktop.
Efficient tree traversing-based list syncing concept

alright so I thought about this thing for a while but I think I got the Gist of it haha

Goal

Efficient syncing for huge lists, accounting for new items, changes and deletions.

Context

Connection basis is between server and clients, one-to-many. Might be most efficient in websocket format, because of all the back-and-forth.

I find that the most difficult part of this problem is determining what the source of truth is, because the clients can request a sync at literally any time.

Item format

For syncing many items in a list, each item needs to have some sort of unique identifier. This could be a timestamp, UUID, maybe even a combination of the two. For the sake of this concept, let's justr say it's a combination of timestamp in ms since Unix epoch and a UUID, both individually available.

Approach

This approach for syncing uses a tree traversing method for eliminating the need for syncing items that haven't been deleted or haven't changed.

Syncing concept

Scenario 1: First time sync

Okay, so, let's say, the server has many entries in this list. The client has none because it has never synced before.

  1. The client sends a lastSync timestamp of 0, indicating it has never synced before. Could maybe also be a null value, but that's up to implementation.
  2. The server checks its list for items which have a timestamp larger than client's lastSync, which should be all in the list.
  3. The server sends these items.
  4. The client receives these items.
  5. The client saves its latest sync time as lastSync if the data was transmitted correctly.

ezpz

Scenario 2: nth time sync

The server has many entries. The client does as well. The client initiates the sync. Important: the sync should account for past-deleted and past-changed items.

  1. The client calculates a checksum of all its items.
  2. The client sends this checksum to the server.
  3. The server calculates a checksum of all its items.
  4. The server compares the received checksum to the calculated one.

Here, multiple things could happen:

✅ The checksums match

Nothing changed between the current client's version of the list and the server's. The server sends an acknowledgement (e.g. its calculated checksum).

❌ The checksums don't match

Oh noes! Now what?

  1. The server sends its checksum to the sync-requesting client.

  2. The client sees the mismatch and splits the list into two.

  3. The client calculates the checksums for both lists.

  4. The client sends both checksums, along with a bit indicating which half of the split corresponds to which checksum. e.g.

    {
      "0": "7b163918-df86-4c2f-8a0c-36f510d716a2",
      "1": "c012e9ce-abb3-45db-b10c-0ffcbc0980aa"
    }
  5. The server does the same for its list and compares those.
    Again, there's multiple options that could be happening here.

    • Likely, when the list in question is in stack form (notably last-in) or ordered by timestamp from old to new, the checksum of the second half of the split won't match when the first doesn't, indicating deletion of an item in the first half.
      If deletions only occurred in the second half and the checksum of the first half matches, the first half can logically be ignored.
    • If editing any item is possible and the checksum of one of the halves is changed, it is not possible to apply above logic.
  6. For each non-matching half, re-iterate the steps for this section until

    • a given depth is reached, or
    • the list is not splittable any further (checksums of individual items are reached).

    The only difference is that the server and client should include the entire taken path on every next split, not re-sending known path checksums. e.g.

    {
      "00": "6435f174-d038-49d7-b1f6-798149317bb8",
      "01": "9240a64f-ffad-46d5-81d5-d90f6645161b",
      "10": "5b249bbb-dad6-4630-846a-8b7b79fe4e64",
      "11": "d26c5a2c-5200-49c8-ad32-2900b0552ee8"
    }

When the end of the tree traversing is reached, the server should figure out which items need to be sent. The most practical would be to send every item whose checksum isn't included in the client's. Data would include:

  • the item
  • optional: the item's checksum (mostly only practical when you're using a key/value store (key: checksum, value: item) instead of a list-esque format for transmission)
  • optional: the item before this item's checksum (practical for ordering)

The client should then figure out the ordering of the items and update the items where needed, by unique identifier.

I think that's it. There's probably a more efficient way to sync lists, but this is what i came up with, and it seems decent I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment