Open-Athena/tomat#1 v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks

Background: tokenizer evolution so far

Version	Dataset prefix	Codec	Size (train)	Notes
v1	`train-full`	`two_token_9_12` (2 tokens / voxel, float)	21.1 GB	Original; replaced by LMQ.
v2-prelim	`train-full-lmq`	LMQ (initial fit)	11.6 GB	Replaced by Lloyd-Max log-spaced fit (`dbb0312`).
v2	`train-full-lmq-v2`	LMQ (log-spaced, vocab ~16k)	13.4 GB	Workhorse for most 200M / 1B runs.
v2-vocab-32k	`train-full-lmq-v2-32k`	LMQ vocab=32k	15.8 GB	Vocab sweep.
v2-vocab-65k	`train-full-lmq-v2-65k`	LMQ vocab=65k	18.2 GB	Vocab sweep.
v2-lat	`train-full-lmq-v2-lat`	LMQ + 6 lattice constants in preamble (`9c3e85f`)	13.4 GB	Current default for fresh runs.

All v2 variants share P=14, M=32, 1 patch / sequence, sequences padded to 8192. Per-sequence preamble:

[BOS]
[GRID_START]    nx ny nz
[LATTICE_START] a b c α β γ                  ← v2-lat only
[ATOMS_START]   Z₁ … Zₙ [ATOMS_END]
[POS_START]     ⟨x₁ y₁ z₁⟩ … (frac coords, 3-byte codec) [POS_END]
[SHAPE_START]   P P P
[OFFSET_START]  ix iy iz
[HI_START]      hx hy hz                      ← (ix+P-1) mod nx (PBC wrap)
[DENS_START]    P³ density tokens (1 / voxel under LMQ)
[EOS]

GCS root: gs://marin-eu-west4/tomat/tokenized/. Codecs: gs://marin-eu-west4/tomat/codecs/lmq-v2-{16,32,65}k.npz.

Why retokenize again

Three problems with v2(-lat) we want to fix together:

Atom positions in the preamble are origin-anchored, not patch-anchored. A given atom contributes the same fractional-coord tokens to every patch from the same material. The model has to re-learn the relative geometry per-patch from the OFFSET / HI tokens. That's a lot of asks the tokenizer can just solve directly.
SHAPE, OFFSET, HI blocks are redundant or near-redundant once we adopt per-patch translation. P is constant across all patches in a run; OFFSET only matters for the model to "place" the patch in the global frame, but if atoms are translated to the patch frame, OFFSET / HI become orientation hints we don't need.
Voxel coverage is ~5%. With P=14³ × M=32 = 87.8k voxels touched per material, against a ~1.2M-voxel typical volume, training only ever sees ~7% of voxels per epoch. Closer to ~5% accounting for the coupon-collector overlap. Most voxels never appear in training, no matter how many epochs. Increasing M linearly increases coverage.

Design

Per-patch translated atoms

For a patch with low-corner offset (ix, iy, iz) and grid dims (nx, ny, nz):

frac_translated = (frac_coords - (ix/nx, iy/ny, iz/nz)) mod 1

Each patch gets its own atom-position block. Atom identities (Zs) are unchanged across patches; only the coords differ.

Because positions are now patch-relative, OFFSET and HI become unnecessary for understanding patch geometry. Drop them.

SHAPE is also dropped: P is fixed at config-time and the same for all patches in a run, so it's pure preamble bloat. (Exception: see fallback patch below.)

`P=19³` with `(18, 19, 19)` fallback

Bumping P from 14 → 19 raises voxel-density per patch from 2744 → 6859 (2.5×) — that's the dense block that actually exposes the model to density learning, so this is the high-leverage change.

8192 context budget for density alone: 6859 = 19³ leaves 1333 for preamble. At 6 lat + 3 grid + ~2-token start/end markers + Z atoms + 3 × n_atoms position tokens, this fits comfortably for n_atoms ≤ ~80, which covers ~99.7% of MP train mats per data/mpdb.sqlite.

For the long tail (max=154 atoms, p99=88): use a fallback patch shape (18, 19, 19) = 6498 density tokens (saves 361 tokens), to which we prepend a SHAPE block restored just for the fallback case. SHAPE remains absent in the default-P case.

This introduces two preamble shapes the model will see, which is a mild bimodality concern. Mitigation: the fallback is rare (≪ 1% of mats based on n_atoms distribution; need to confirm with a tokenize dry-run), and the only delta is the SHAPE block, which is small and salient.

`M=64`

Doubles voxel coverage per epoch from ~5% → ~10%. Linear in M, so this is the cheapest knob. Together with P=14³ → P=19³ (2.5× volume), per-mat coverage goes from 32 × 14³ / V → 64 × 19³ / V ≈ 5× more voxels seen per epoch.

Size estimate for train-full-v3: v2-lat is 13.4 GB at P=14, M=32. v3 is roughly 2.5× × 2× = ~5× larger ≈ 65–70 GB.

Pluggable sample weighting

Sampler interface in the tokenizer that picks M patches per material according to a configurable weight:

uniform — every patch eligible (subject to PBC wrap), uniform random. Default for the v3 a2a vs v2-lat baseline.
electrons — patch weight ∝ Σ ρ in patch (proxy for "fraction of material's electrons in this region").
(Future, opt-in) other weightings — TBD as we learn from results.

Why uniform-by-default for the first v3 run: we want to attribute any NMAE delta vs v2-lat to the tokenization changes (translation, P=19, M=64, dropped preamble blocks), not co-confounded with a weighting change. Sampling-weight study is its own issue.

Coverage histogram artifact

Add a coverage_histogram.json per-worker in the tokenize job output: how many voxels were covered exactly 0, 1, 2, … times by the M sampled patches per material. Aggregate across workers post-hoc. This is the ground truth on coverage and lets us validate the coupon-collector estimate empirically.

Implementation surface

src/tomat/tokenizers/patch.py — translate atom coords, drop SHAPE / OFFSET / HI blocks (default), restore SHAPE for fallback shape.
src/tomat/tokenizers/sampler.py (new) — pluggable weight strategy.
scripts/tokenize_patches.py — wire up new flags / config.
marin/eval_mat_nmae.py — update preamble construction for v3 to match training format.
marin/train_tomat_tpu.py — read v3 meta.json shape (no SHAPE etc.).

Datasets to produce

Label	Set	Size estimate
`train-full-v3`	train	~65–70 GB
`val-full-v3`	val	~3–4 GB
`test-full-v3`	test	(deferred — see MPDB issue)

Modal-side; ~hours.

Pre-registered comparison

a2a train-full-v3 vs train-full-lmq-v2-lat, both at 200M and 1B, matched LR / BS / steps, on the same eval set (val_200, mat-NMAE).

Hypothesis: v3 reduces NMAE by ≥ 0.3 percentage points at the matched budget, primarily from increased voxel coverage.

If v3 doesn't beat v2-lat, that itself is informative: it tells us the remaining gap to ChargE3Net (0.523%) is mostly architecture / data-prior, not coverage / tokenization.

Out of scope

Equivariance (architecture, not tokenization).
Charge conservation hooks (post-process / loss).
Rotation augmentation (orthogonal to tokenization).
v3 sampling-weight ablations (separate issue).
Test-set tokenize and eval (deferred until MPDB has test n_atoms).

ryan-williams/tomat#1.md

Select an option

No results found

Select an option

No results found

Open-Athena/tomat#1 v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks

Background: tokenizer evolution so far

Why retokenize again

Design

Per-patch translated atoms

`P=19³` with `(18, 19, 19)` fallback

`M=64`

Pluggable sample weighting

Coverage histogram artifact

Implementation surface

Datasets to produce

Pre-registered comparison

Out of scope

ryan-williams/tomat#1.md

Open-Athena/tomat#1 v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks

Background: tokenizer evolution so far

Why retokenize again

Design

Per-patch translated atoms

P=19³ with (18, 19, 19) fallback

M=64

Pluggable sample weighting

Coverage histogram artifact

Implementation surface

Datasets to produce

Pre-registered comparison

Out of scope

`P=19³` with `(18, 19, 19)` fallback

`M=64`