Open-Athena/tomat#1 v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks
| Version | Dataset prefix | Codec | Size (train) | Notes |
|---|---|---|---|---|
| v1 | train-full |
two_token_9_12 (2 tokens / voxel, float) |
21.1 GB | Original; replaced by LMQ. |
| v2-prelim | train-full-lmq |
LMQ (initial fit) | 11.6 GB | Replaced by Lloyd-Max log-spaced fit (dbb0312). |
| v2 | train-full-lmq-v2 |
LMQ (log-spaced, vocab ~16k) | 13.4 GB | Workhorse for most 200M / 1B runs. |
| v2-vocab-32k | train-full-lmq-v2-32k |
LMQ vocab=32k | 15.8 GB | Vocab sweep. |
| v2-vocab-65k | train-full-lmq-v2-65k |
LMQ vocab=65k | 18.2 GB | Vocab sweep. |
| v2-lat | train-full-lmq-v2-lat |
LMQ + 6 lattice constants in preamble (9c3e85f) |
13.4 GB | Current default for fresh runs. |
All v2 variants share P=14, M=32, 1 patch / sequence, sequences padded
to 8192. Per-sequence preamble:
[BOS]
[GRID_START] nx ny nz
[LATTICE_START] a b c α β γ ← v2-lat only
[ATOMS_START] Z₁ … Zₙ [ATOMS_END]
[POS_START] ⟨x₁ y₁ z₁⟩ … (frac coords, 3-byte codec) [POS_END]
[SHAPE_START] P P P
[OFFSET_START] ix iy iz
[HI_START] hx hy hz ← (ix+P-1) mod nx (PBC wrap)
[DENS_START] P³ density tokens (1 / voxel under LMQ)
[EOS]
GCS root: gs://marin-eu-west4/tomat/tokenized/.
Codecs: gs://marin-eu-west4/tomat/codecs/lmq-v2-{16,32,65}k.npz.
Three problems with v2(-lat) we want to fix together:
- Atom positions in the preamble are origin-anchored, not patch-anchored. A given atom contributes the same fractional-coord tokens to every patch from the same material. The model has to re-learn the relative geometry per-patch from the OFFSET / HI tokens. That's a lot of asks the tokenizer can just solve directly.
SHAPE,OFFSET,HIblocks are redundant or near-redundant once we adopt per-patch translation. P is constant across all patches in a run; OFFSET only matters for the model to "place" the patch in the global frame, but if atoms are translated to the patch frame, OFFSET / HI become orientation hints we don't need.- Voxel coverage is ~5%. With
P=14³ × M=32= 87.8k voxels touched per material, against a ~1.2M-voxel typical volume, training only ever sees ~7% of voxels per epoch. Closer to ~5% accounting for the coupon-collector overlap. Most voxels never appear in training, no matter how many epochs. IncreasingMlinearly increases coverage.
For a patch with low-corner offset (ix, iy, iz) and grid dims (nx, ny, nz):
frac_translated = (frac_coords - (ix/nx, iy/ny, iz/nz)) mod 1
Each patch gets its own atom-position block. Atom identities (Zs) are
unchanged across patches; only the coords differ.
Because positions are now patch-relative, OFFSET and HI become unnecessary for understanding patch geometry. Drop them.
SHAPE is also dropped: P is fixed at config-time and the same for all patches in a run, so it's pure preamble bloat. (Exception: see fallback patch below.)
Bumping P from 14 → 19 raises voxel-density per patch from 2744 → 6859 (2.5×) — that's the dense block that actually exposes the model to density learning, so this is the high-leverage change.
8192 context budget for density alone: 6859 = 19³ leaves 1333 for
preamble. At 6 lat + 3 grid + ~2-token start/end markers + Z atoms +
3 × n_atoms position tokens, this fits comfortably for n_atoms ≤ ~80,
which covers ~99.7% of MP train mats per data/mpdb.sqlite.
For the long tail (max=154 atoms, p99=88): use a fallback patch shape
(18, 19, 19) = 6498 density tokens (saves 361 tokens), to which we
prepend a SHAPE block restored just for the fallback case. SHAPE
remains absent in the default-P case.
This introduces two preamble shapes the model will see, which is a mild bimodality concern. Mitigation: the fallback is rare (≪ 1% of mats based on n_atoms distribution; need to confirm with a tokenize dry-run), and the only delta is the SHAPE block, which is small and salient.
Doubles voxel coverage per epoch from ~5% → ~10%. Linear in M, so this is the cheapest knob. Together with P=14³ → P=19³ (2.5× volume), per-mat coverage goes from 32 × 14³ / V → 64 × 19³ / V ≈ 5× more voxels seen per epoch.
Size estimate for train-full-v3: v2-lat is 13.4 GB at P=14, M=32.
v3 is roughly 2.5× × 2× = ~5× larger ≈ 65–70 GB.
Sampler interface in the tokenizer that picks M patches per material
according to a configurable weight:
uniform— every patch eligible (subject to PBC wrap), uniform random. Default for the v3 a2a vs v2-lat baseline.electrons— patch weight ∝ Σ ρ in patch (proxy for "fraction of material's electrons in this region").- (Future, opt-in) other weightings — TBD as we learn from results.
Why uniform-by-default for the first v3 run: we want to attribute any NMAE delta vs v2-lat to the tokenization changes (translation, P=19, M=64, dropped preamble blocks), not co-confounded with a weighting change. Sampling-weight study is its own issue.
Add a coverage_histogram.json per-worker in the tokenize job output:
how many voxels were covered exactly 0, 1, 2, … times by the M sampled
patches per material. Aggregate across workers post-hoc. This is the
ground truth on coverage and lets us validate the coupon-collector
estimate empirically.
src/tomat/tokenizers/patch.py— translate atom coords, drop SHAPE / OFFSET / HI blocks (default), restore SHAPE for fallback shape.src/tomat/tokenizers/sampler.py(new) — pluggable weight strategy.scripts/tokenize_patches.py— wire up new flags / config.marin/eval_mat_nmae.py— update preamble construction for v3 to match training format.marin/train_tomat_tpu.py— read v3 meta.json shape (no SHAPE etc.).
| Label | Set | Size estimate |
|---|---|---|
train-full-v3 |
train | ~65–70 GB |
val-full-v3 |
val | ~3–4 GB |
test-full-v3 |
test | (deferred — see MPDB issue) |
Modal-side; ~hours.
a2a train-full-v3 vs train-full-lmq-v2-lat, both at 200M and 1B,
matched LR / BS / steps, on the same eval set (val_200, mat-NMAE).
Hypothesis: v3 reduces NMAE by ≥ 0.3 percentage points at the matched budget, primarily from increased voxel coverage.
If v3 doesn't beat v2-lat, that itself is informative: it tells us the remaining gap to ChargE3Net (0.523%) is mostly architecture / data-prior, not coverage / tokenization.
- Equivariance (architecture, not tokenization).
- Charge conservation hooks (post-process / loss).
- Rotation augmentation (orthogonal to tokenization).
- v3 sampling-weight ablations (separate issue).
- Test-set tokenize and eval (deferred until MPDB has test n_atoms).