- remove optimizer state and save to
$HOME
for example:
MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128
python scripts/remove_opt_state.py \
$MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \
checkpoint_1_105000_eval \
--nproc 4 --resume-failed
Note you can do larger nproc (--nproc 32
) on learnfair, but if you do it on devfair the code sometimes hangs.
-
move to somewhere in /large_experiments (I am fuzzy on the chown command)
-
update
model_configs.py