Skip to content

Instantly share code, notes, and snippets.

@pesterhazy
Last active November 15, 2024 10:17
Show Gist options
  • Save pesterhazy/3e039677f2e314cb77ffe3497ebca07b to your computer and use it in GitHub Desktop.
Save pesterhazy/3e039677f2e314cb77ffe3497ebca07b to your computer and use it in GitHub Desktop.
Building an offline realtime sync engine

So you want to write a sync system for a web app with offline and realtime support? Good luck. You might find the following resources useful.

Overview articles

Sync engines in the industry

Off-the-shelf sync systems

Every app has slightly different needs. So my guess is that you will need to build your own system. But it's worth taking a look at off-the-shelf sync engines, either as buy-over-build or to steal ideas:

  • Firebase

    Confusingly Google offers two realtime databases under the Firebase brand. Cloud Firestore and Firebase Realtime Database. Lots of good ideas, but they're hard to disentangle from the marketing copy.

  • Couchdb / Pouchdb

    Couchdb is a database based around replication. The CouchDB book, while a bit dated, is well written and worth reading.

    Pouchdb offers an in-browser database with server replication, and did so before it was cool. It's still pretty impressive today, and free software.

  • RxDB

    RxDB is a local-first, NoSQL-database for JavaScript Applications like Websites, hybrid Apps, Electron-Apps, Progressive Web Apps, Deno and Node.js.

  • Fluid Framework

    Fluid Framework is a "collection of client libraries for distributing and synchronizing shared state. These libraries allow multiple clients to simultaneously create and operate on shared data structures using coding patterns similar to those used to work with local data." Fluid is developed by Microsoft.

  • WatermelonDB - an offline-first local db (React and ReactNative) with capabilities based on IndexedDB/SQLite

    WatermelonDB offers a client-side sync implementation and API; you still need to write your own backend.

  • AWS AppSync

    AppSync is AWS's attempt at offering automated realtime GraphQL updates. It's one of the weirder AWS offerings. As is often the case with AWS marketing-driven producs, it's hard to figure out from the docs if it's a viable alternative for a given problem.

  • fireproof

    Fireproof is an embedded database for collaborative applications. Install it in your front-end app, or use it in any serverless cloud or edge function. Fireproof’s document API includes live updates, flexible queries, binary attachments, encrypted block replication, and multi-user sync. Fireproof enables developers to ship interactive features faster in any deployment environment.

  • Electric SQL

    SQLite in the browser automatically and seemlessly synced with Postgres. That's a big promise – can ElectricSQL deliver?

  • PowerSync

    Another offering for syncing Postgres with a local SQLite.

  • MongoDB Device Sync

    Syncs a MongoDB database with local devices based on the Realm embedded database

  • Instant

    Sync system as a service inspired by Clojure and Datomic

Databases

When you're building a sync engine, you're essentially building a database with replication - whether you realize it or not. So it's a good idea to review some of the literature on databases and replication.

  • Datomic

    Learn as much as you can about Datomic - Datalog vs SQL, inserts/updates as pure data structures, pull syntax, EAV tuples, immutable facts, database as a value, unbundling the database. There's so much to learn. Datomic may not be the right database for your backend (although maybe it is? Check it out) but it's without a doubt one of the best-designed systems out there.

    Datomic with Rich Hickey

    The docs are excellent.

    Datascript is a client-side version of Datomic (but there's no built-in sync engine).

  • Postgres is the most mature RDBMS out there (but doesn't by itself help you replicate data to the client). There's a lot to learn from from decades of research and production experience. The chapter in the Postgres manual on Isolation levels is excellent.

SaaS in the sync space

Some SaaS are trying to offer multiplayer, syncing, caching etc as a service. No clear winner yet. But even if you're not going to use these, it's worth reading the API docs for inspiration.

  • replicache "is a JavaScript framework for building high-performance, offline-capable, collaborative web apps"
  • reflect "is a high-performance sync for multiplayer web apps"
  • Liveblocks "is a set of APIs and tools that helps you create performant and reliable multiplayer experiences in minutes"
  • Supabase

Realtime updates

The simplest way to replicate data is for client to poll for updates every second or so. But that introduces unnecessary server load and >1 s latencies, so eventually you will want to switch to a push model. On the web that usually means using Websockets or Server-Sent Events to allow the server to push updates to the clients as they arrive.

Notice that realtime updates are conceptually distinct from data replication - you can simplify the system by building a polling based solution first and adding push updates as a separate step. The realtime update can be in the form of a lightweight shoulder tap, a message that wakes up the client, causing it to pull the latest updates from the server. Realtime notificaitons can be seen as optional - when the websocket connection is not available, you can fall back to polling.

Instead of rolling your own websocket server, consider using a hosted service like pusher.com or ably.com. They do all the heavy lifting for you (like keeping thousands of simultaneous TCP connections open) and provide a request/response style interface for your server to send messages to connected clients.

Literature

  • Differential Synchronization, Neil Fraser, 2009 (link)

    There's also a youtube talk by the author on the same topic

  • Kleppmann, Designing Data-Intensive Applications, O'Reilly (2017)

    If you don't have this book yet, drop everything and read it now. Every chapter is full of insights and summaries of how to use and build databases, much of it applicable to sync engines.

Local-first software is a rebranding of offline-first systems. The name emphasizes the philosophical stance that data lives on your device first and foremost, with implications for privacy, ownership and decentralization.

  • Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan: “Local-first software: You own your data, in spite of the cloud”. ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! ’19), October 2019 (link)

Conflict-free replicated data types (CRDTs)

I recommend learning about CRDTs. Not because a sync system should be built on CRDTs (it probably shouldn't unless you're building a truly decentralized, peer-to-peer system), but because everyone's talking about them and it's useful to understand their limitations (and strengths). Also the literature contains many concepts that are helpful even if you don't need costly decentralization because can rely on a centralized server.

While CRDTs offer advantages like operation without a centralized server, they often suffer from problems with regard to performance, memory usage and implementation complexity.

  • Martin Kleppmann, Alastair R. Beresford: "A Conflict-Free Replicated JSON Datatype". IEEE Transactions on Parallel and Distributed Systems 28(10):2733–2746, April 2017 (link)

    Very accessible paper on what ended up being published as automerge

  • Marijn Haverbeke, Collaborative Editing in Codemirror

    Pragmatic discussion of the (according to the author) false dichotomy between OT and CRDTs, in the context of collaborative text editing

Many more papers and other resources are avaialble on crdt.tech.

Distributed systems

CS students study this topic in college. If, like me, you skipped this part of your eductation, it's worth learning the basic theory to get a better overview of the problem space.

I recommend Linsey Kuper's lectures at UC Santa Cruz, which she's generously made available on Youtube. The course also has a website.

It's fun and you'll learn about Lamport diagrams and consistency models.

There's also aphyr's braindump of interesting ideas in distributed systems. He's also has a nice page describing consistency models.

IndexedDB

If you're storing data in a browser for offline use, it's probably going to end up in IndexedDB. Here's a lot more information about this janky corner of the web platform.

@Mulkave
Copy link

Mulkave commented Sep 10, 2024

@amjadbouhouch
Copy link

Thanks for all of this

@kobiebotha
Copy link

Update: The entry for PowerSync should be Another offering for syncing Postgres, MongoDB or MySQL with a local SQLite. since 2x additional backend DB connectors have been added

@mewc
Copy link

mewc commented Nov 14, 2024

@zarkone
Copy link

zarkone commented Nov 15, 2024

hey Paulus! Thank you!

I'm currently working in https://anytype.io/ where we are building a distributed, p2p sync engine on top of https://github.com/anyproto/any-sync -- just decided to mention in case it deserves a place in your wonderful, comprehensive list 😊

An open-source protocol designed to create high-performance, local-first, peer-to-peer, end-to-end encrypted applications that facilitate seamless collaboration among multiple users and devices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment