Skip to content

Instantly share code, notes, and snippets.

@fry69
Last active October 25, 2025 16:23
Show Gist options
  • Select an option

  • Save fry69/d517dbab9206949167a5b0c8bd1ef59e to your computer and use it in GitHub Desktop.

Select an option

Save fry69/d517dbab9206949167a5b0c8bd1ef59e to your computer and use it in GitHub Desktop.
Notes for estimating an atproto CAR file size

• Findings

  • The PDS serves repository exports only through com.atproto.sync.getRepo, which opens the actor’s SQLite DB, instantiates a SqlRepoReader, and streams the CAR via writeCarStream (packages/pds/src/api/com/atproto/sync/getRepo.ts:14-57). Because the stream is produced on the fly, no Content-Length or other size calculation is performed in this path.

  • Each DAG-CBOR block that can appear in a CAR file is persisted in the actor DB’s repo_block table with an explicit size column (packages/pds/src/actor-store/db/schema/repo-block.ts:3-8). The transactor always records the raw byte length of the block when inserting rows (packages/pds/src/actor-store/repo/sql-repo-transactor.ts:33-61), so the DB already knows the exact payload size per block.

  • SqlRepoReader simply pages through those rows (respecting since) and feeds their content blobs into writeCarStream (packages/pds/src/actor-store/repo/sql-repo-reader.ts:79-138). The only auxiliary method there, countBlocks, is unused; there is no helper that sums size, nor is there any API (e.g., getRepoStatus) that surfaces a byte estimate.

  • The CAR writer adds deterministic overhead: a CBOR header plus, for each block, varint(cid.bytes.length + block.length) + cid.bytes + block. In this implementation (packages/repo/src/car.ts:16-34), repo CIDs are CIDv1/sha256 (36 bytes), so the per-block overhead is roughly 37–39 bytes (varint) beyond the stored size. That means you can estimate the finished CAR as: total = headerLen + Σ[varintLen(36 + block.size) + 36 + block.size] for all rows selected by repoRev > since (or all rows if since unset).

How to Get an Estimate

  1. For a specific DID, open its actor DB and run SELECT SUM(size) FROM repo_block WHERE repoRev > ? (omit the predicate for a full export). That gives the exact sum of CAR payload bytes.

  2. Add overhead:

    • Header is tiny (<60 bytes) because it’s just {version:1, roots:[rootCid]} encoded as CBOR plus its length varint.
    • For each block, add 36 bytes for the CID plus the length of the preceding varint (1–5 bytes depending on block size; repo blocks are usually <128 KiB so the varint is 2–3 bytes).
  3. If you need a precise value, you can reproduce the stream in memory and count bytes by piping writeCarStream into a counter, but that defeats the purpose of estimating ahead of time.

Example (using Kysely inside the PDS codebase):

  const totalBlockBytes = await actorDb.db
    .selectFrom('repo_block')
    .select(({ fn }) => fn.sum('size').as('sum'))
    .if(since, (qb, s) => qb.where('repoRev', '>', s))
    .executeTakeFirst()
    .then((row) => Number(row?.sum ?? 0))

  const blockCount = await actorDb.db
    .selectFrom('repo_block')
    .select(countAll.as('count'))
    .if(since, (qb, s) => qb.where('repoRev', '>', s))
    .executeTakeFirst()
    .then((row) => Number(row?.count ?? 0))

  const estimatedCarBytes =
    headerBytes +
    blockCount * (36 + averageVarintLen) + // per-block overhead
    totalBlockBytes

Because SQLite pages, indexes, and WAL files add their own overhead, the on-disk .sqlite file will always be larger than the exported CAR. The only reliable way to relate them is through the summed repo_block.size described above. If this needs to be exposed publicly, you’d add an endpoint (or extend getRepoStatus) that runs those aggregate queries and returns the estimate before streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment