Notes for estimating an atproto CAR file size

• Findings

The PDS serves repository exports only through com.atproto.sync.getRepo, which opens the actor’s SQLite DB, instantiates a SqlRepoReader, and streams the CAR via writeCarStream (packages/pds/src/api/com/atproto/sync/getRepo.ts:14-57). Because the stream is produced on the fly, no Content-Length or other size calculation is performed in this path.
Each DAG-CBOR block that can appear in a CAR file is persisted in the actor DB’s repo_block table with an explicit size column (packages/pds/src/actor-store/db/schema/repo-block.ts:3-8). The transactor always records the raw byte length of the block when inserting rows (packages/pds/src/actor-store/repo/sql-repo-transactor.ts:33-61), so the DB already knows the exact payload size per block.
SqlRepoReader simply pages through those rows (respecting since) and feeds their content blobs into writeCarStream (packages/pds/src/actor-store/repo/sql-repo-reader.ts:79-138). The only auxiliary method there, countBlocks, is unused; there is no helper that sums size, nor is there any API (e.g., getRepoStatus) that surfaces a byte estimate.
The CAR writer adds deterministic overhead: a CBOR header plus, for each block, varint(cid.bytes.length + block.length) + cid.bytes + block. In this implementation (packages/repo/src/car.ts:16-34), repo CIDs are CIDv1/sha256 (36 bytes), so the per-block overhead is roughly 37–39 bytes (varint) beyond the stored size. That means you can estimate the finished CAR as: total = headerLen + Σ[varintLen(36 + block.size) + 36 + block.size] for all rows selected by repoRev > since (or all rows if since unset).

How to Get an Estimate

For a specific DID, open its actor DB and run SELECT SUM(size) FROM repo_block WHERE repoRev > ? (omit the predicate for a full export). That gives the exact sum of CAR payload bytes.
Add overhead:
- Header is tiny (<60 bytes) because it’s just {version:1, roots:[rootCid]} encoded as CBOR plus its length varint.
- For each block, add 36 bytes for the CID plus the length of the preceding varint (1–5 bytes depending on block size; repo blocks are usually <128 KiB so the varint is 2–3 bytes).
If you need a precise value, you can reproduce the stream in memory and count bytes by piping writeCarStream into a counter, but that defeats the purpose of estimating ahead of time.

Example (using Kysely inside the PDS codebase):

  const totalBlockBytes = await actorDb.db
    .selectFrom('repo_block')
    .select(({ fn }) => fn.sum('size').as('sum'))
    .if(since, (qb, s) => qb.where('repoRev', '>', s))
    .executeTakeFirst()
    .then((row) => Number(row?.sum ?? 0))

  const blockCount = await actorDb.db
    .selectFrom('repo_block')
    .select(countAll.as('count'))
    .if(since, (qb, s) => qb.where('repoRev', '>', s))
    .executeTakeFirst()
    .then((row) => Number(row?.count ?? 0))

  const estimatedCarBytes =
    headerBytes +
    blockCount * (36 + averageVarintLen) + // per-block overhead
    totalBlockBytes

Because SQLite pages, indexes, and WAL files add their own overhead, the on-disk .sqlite file will always be larger than the exported CAR. The only reliable way to relate them is through the summed repo_block.size described above. If this needs to be exposed publicly, you’d add an endpoint (or extend getRepoStatus) that runs those aggregate queries and returns the estimate before streaming.

fry69/estimate.md

Select an option

No results found

Select an option

No results found