Description of the PR...
Ran scripts/sampling_distribution_preview.py against MPDB v2's
77,427-mat train split. Per-mat M_i under each candidate
weighting (mean target M̄=64):
| weighting | min | p10 | p50 | p90 | max | max:min |
|---|---|---|---|---|---|---|
| uniform | 64 | 64 | 64 | 64 | 64 | 1× |
| electrons | 0.14 | 16.8 | 47.0 | 134.4 | 609 | 4328× |
[Open-Athena/tomat#2] MPDB: backfill n_atoms / n_electrons, publish to R2
data/mpdb.sqlite is the materials-metadata database for MP entries we
train on (built by scripts/build_mpdb.py). Current schema includes
mp_id, split, grid dims nx/ny/nz, and computed virtual columns
for cube_seq_pN. Train rows have n_atoms. Val and test rows have
n_atoms = NULL.
[Open-Athena/tomat#1] v3 patch tokenizer: per-patch translated atoms, P=19, M=64, drop redundant preamble blocks
| Version | Dataset prefix | Codec | Size (train) | Notes |
|---|---|---|---|---|
| v1 | train-full |
two_token_9_12 (2 tokens / voxel, float) |
21.1 GB | Original; replaced by LMQ. |
| v2-prelim | train-full-lmq |
LMQ (initial fit) | 11.6 GB | Replaced by Lloyd-Max log-spaced fit (dbb0312). |
| v2 | train-full-lmq-v2 |
LMQ (log-spaced, vocab ~16k) | 13.4 GB | Workhorse for most 200M / 1B runs. |
| v2-vocab-32k | train-full-lmq-v2-32k |
LMQ vocab=32k | 15.8 GB | Vocab sweep. |
NewerOlder