Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Created May 6, 2026 00:52
Show Gist options
  • Select an option

  • Save ryan-williams/0adc7a897640bc3d1df02a4586d07652 to your computer and use it in GitHub Desktop.

Select an option

Save ryan-williams/0adc7a897640bc3d1df02a4586d07652 to your computer and use it in GitHub Desktop.

Open-Athena/tomat#1 v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks

Background: tokenizer evolution so far

Version Dataset prefix Codec Size (train) Notes
v1 train-full two_token_9_12 (2 tokens / voxel, float) 21.1 GB Original; replaced by LMQ.
v2-prelim train-full-lmq LMQ (initial fit) 11.6 GB Replaced by Lloyd-Max log-spaced fit (dbb0312).
v2 train-full-lmq-v2 LMQ (log-spaced, vocab ~16k) 13.4 GB Workhorse for most 200M / 1B runs.
v2-vocab-32k train-full-lmq-v2-32k LMQ vocab=32k 15.8 GB Vocab sweep.
v2-vocab-65k train-full-lmq-v2-65k LMQ vocab=65k 18.2 GB Vocab sweep.
v2-lat train-full-lmq-v2-lat LMQ + 6 lattice constants in preamble (9c3e85f) 13.4 GB Current default for fresh runs.

All v2 variants share P=14, M=32, 1 patch / sequence, sequences padded to 8192. Per-sequence preamble:

[BOS]
[GRID_START]    nx ny nz
[LATTICE_START] a b c α β γ                  ← v2-lat only
[ATOMS_START]   Z₁ … Zₙ [ATOMS_END]
[POS_START]     ⟨x₁ y₁ z₁⟩ … (frac coords, 3-byte codec) [POS_END]
[SHAPE_START]   P P P
[OFFSET_START]  ix iy iz
[HI_START]      hx hy hz                      ← (ix+P-1) mod nx (PBC wrap)
[DENS_START]    P³ density tokens (1 / voxel under LMQ)
[EOS]

GCS root: gs://marin-eu-west4/tomat/tokenized/. Codecs: gs://marin-eu-west4/tomat/codecs/lmq-v2-{16,32,65}k.npz.

Why retokenize again

Three problems with v2(-lat) we want to fix together:

  1. Atom positions in the preamble are origin-anchored, not patch-anchored. A given atom contributes the same fractional-coord tokens to every patch from the same material. The model has to re-learn the relative geometry per-patch from the OFFSET / HI tokens. That's a lot of asks the tokenizer can just solve directly.
  2. SHAPE, OFFSET, HI blocks are redundant or near-redundant once we adopt per-patch translation. P is constant across all patches in a run; OFFSET only matters for the model to "place" the patch in the global frame, but if atoms are translated to the patch frame, OFFSET / HI become orientation hints we don't need.
  3. Voxel coverage is ~5%. With P=14³ × M=32 = 87.8k voxels touched per material, against a ~1.2M-voxel typical volume, training only ever sees ~7% of voxels per epoch. Closer to ~5% accounting for the coupon-collector overlap. Most voxels never appear in training, no matter how many epochs. Increasing M linearly increases coverage.

Design

Per-patch translated atoms

For a patch with low-corner offset (ix, iy, iz) and grid dims (nx, ny, nz):

frac_translated = (frac_coords - (ix/nx, iy/ny, iz/nz)) mod 1

Each patch gets its own atom-position block. Atom identities (Zs) are unchanged across patches; only the coords differ.

Because positions are now patch-relative, OFFSET and HI become unnecessary for understanding patch geometry. Drop them.

SHAPE is also dropped: P is fixed at config-time and the same for all patches in a run, so it's pure preamble bloat. (Exception: see fallback patch below.)

P=19³ with (18, 19, 19) fallback

Bumping P from 14 → 19 raises voxel-density per patch from 2744 → 6859 (2.5×) — that's the dense block that actually exposes the model to density learning, so this is the high-leverage change.

8192 context budget for density alone: 6859 = 19³ leaves 1333 for preamble. At 6 lat + 3 grid + ~2-token start/end markers + Z atoms + 3 × n_atoms position tokens, this fits comfortably for n_atoms ≤ ~80, which covers ~99.7% of MP train mats per data/mpdb.sqlite.

For the long tail (max=154 atoms, p99=88): use a fallback patch shape (18, 19, 19) = 6498 density tokens (saves 361 tokens), to which we prepend a SHAPE block restored just for the fallback case. SHAPE remains absent in the default-P case.

This introduces two preamble shapes the model will see, which is a mild bimodality concern. Mitigation: the fallback is rare (≪ 1% of mats based on n_atoms distribution; need to confirm with a tokenize dry-run), and the only delta is the SHAPE block, which is small and salient.

M=64

Doubles voxel coverage per epoch from ~5% → ~10%. Linear in M, so this is the cheapest knob. Together with P=14³ → P=19³ (2.5× volume), per-mat coverage goes from 32 × 14³ / V → 64 × 19³ / V ≈ 5× more voxels seen per epoch.

Size estimate for train-full-v3: v2-lat is 13.4 GB at P=14, M=32. v3 is roughly 2.5× × 2× = ~5× larger ≈ 65–70 GB.

Pluggable sample weighting

Sampler interface in the tokenizer that picks M patches per material according to a configurable weight:

  • uniform — every patch eligible (subject to PBC wrap), uniform random. Default for the v3 a2a vs v2-lat baseline.
  • electrons — patch weight ∝ Σ ρ in patch (proxy for "fraction of material's electrons in this region").
  • (Future, opt-in) other weightings — TBD as we learn from results.

Why uniform-by-default for the first v3 run: we want to attribute any NMAE delta vs v2-lat to the tokenization changes (translation, P=19, M=64, dropped preamble blocks), not co-confounded with a weighting change. Sampling-weight study is its own issue.

Coverage histogram artifact

Add a coverage_histogram.json per-worker in the tokenize job output: how many voxels were covered exactly 0, 1, 2, … times by the M sampled patches per material. Aggregate across workers post-hoc. This is the ground truth on coverage and lets us validate the coupon-collector estimate empirically.

Implementation surface

  • src/tomat/tokenizers/patch.py — translate atom coords, drop SHAPE / OFFSET / HI blocks (default), restore SHAPE for fallback shape.
  • src/tomat/tokenizers/sampler.py (new) — pluggable weight strategy.
  • scripts/tokenize_patches.py — wire up new flags / config.
  • marin/eval_mat_nmae.py — update preamble construction for v3 to match training format.
  • marin/train_tomat_tpu.py — read v3 meta.json shape (no SHAPE etc.).

Datasets to produce

Label Set Size estimate
train-full-v3 train ~65–70 GB
val-full-v3 val ~3–4 GB
test-full-v3 test (deferred — see MPDB issue)

Modal-side; ~hours.

Pre-registered comparison

a2a train-full-v3 vs train-full-lmq-v2-lat, both at 200M and 1B, matched LR / BS / steps, on the same eval set (val_200, mat-NMAE).

Hypothesis: v3 reduces NMAE by ≥ 0.3 percentage points at the matched budget, primarily from increased voxel coverage.

If v3 doesn't beat v2-lat, that itself is informative: it tells us the remaining gap to ChargE3Net (0.523%) is mostly architecture / data-prior, not coverage / tokenization.

Out of scope

  • Equivariance (architecture, not tokenization).
  • Charge conservation hooks (post-process / loss).
  • Rotation augmentation (orthogonal to tokenization).
  • v3 sampling-weight ablations (separate issue).
  • Test-set tokenize and eval (deferred until MPDB has test n_atoms).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment