• Findings
-
The PDS serves repository exports only through com.atproto.sync.getRepo, which opens the actor’s SQLite DB, instantiates a SqlRepoReader, and streams the CAR via writeCarStream (
packages/pds/src/api/com/atproto/sync/getRepo.ts:14-57). Because the stream is produced on the fly, no Content-Length or other size calculation is performed in this path. -
Each DAG-CBOR block that can appear in a CAR file is persisted in the actor DB’s repo_block table with an explicit size column (
packages/pds/src/actor-store/db/schema/repo-block.ts:3-8). The transactor always records the raw byte length of the block when inserting rows (packages/pds/src/actor-store/repo/sql-repo-transactor.ts:33-61), so the DB already knows the exact payload size per block. -
SqlRepoReader simply pages through those rows (respecting since) and feeds their content blobs into writeCarStream (
packages/pds/src/actor-store/repo/sql-repo-reader.ts:79-138). The only auxiliary method there, countBlocks, is unused; there is no helper that sums size, nor is there any API (e.g., getRepoStatus) that surfaces a byte estimate. -
The CAR writer adds deterministic overhead: a CBOR header plus, for each block,
varint(cid.bytes.length + block.length) + cid.bytes + block. In this implementation (packages/repo/src/car.ts:16-34), repo CIDs are CIDv1/sha256 (36 bytes), so the per-block overhead is roughly 37–39 bytes (varint) beyond the stored size. That means you can estimate the finished CAR as:total = headerLen + Σ[varintLen(36 + block.size) + 36 + block.size]for all rows selected by repoRev > since (or all rows if since unset).
How to Get an Estimate
-
For a specific DID, open its actor DB and run
SELECT SUM(size) FROM repo_block WHERE repoRev > ?(omit the predicate for a full export). That gives the exact sum of CAR payload bytes. -
Add overhead:
- Header is tiny (<60 bytes) because it’s just
{version:1, roots:[rootCid]}encoded as CBOR plus its length varint. - For each block, add 36 bytes for the CID plus the length of the preceding varint (1–5 bytes depending on block size; repo blocks are usually <128 KiB so the varint is 2–3 bytes).
- Header is tiny (<60 bytes) because it’s just
-
If you need a precise value, you can reproduce the stream in memory and count bytes by piping writeCarStream into a counter, but that defeats the purpose of estimating ahead of time.
Example (using Kysely inside the PDS codebase):
const totalBlockBytes = await actorDb.db
.selectFrom('repo_block')
.select(({ fn }) => fn.sum('size').as('sum'))
.if(since, (qb, s) => qb.where('repoRev', '>', s))
.executeTakeFirst()
.then((row) => Number(row?.sum ?? 0))
const blockCount = await actorDb.db
.selectFrom('repo_block')
.select(countAll.as('count'))
.if(since, (qb, s) => qb.where('repoRev', '>', s))
.executeTakeFirst()
.then((row) => Number(row?.count ?? 0))
const estimatedCarBytes =
headerBytes +
blockCount * (36 + averageVarintLen) + // per-block overhead
totalBlockBytesBecause SQLite pages, indexes, and WAL files add their own overhead, the on-disk .sqlite file will always be larger than the exported CAR. The only reliable way to relate them is through the summed repo_block.size described above. If this needs to be exposed publicly, you’d add an endpoint (or extend getRepoStatus) that runs those aggregate queries and returns the estimate before streaming.