sshleifer/remove_opt_state_instructions.md

Last active August 5, 2021 20:11

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sshleifer/82d85d0ab63dc4980a9ff0663eae4c08.js"></script>
Save sshleifer/82d85d0ab63dc4980a9ff0663eae4c08 to your computer and use it in GitHub Desktop.

Raw

remove optimizer state and save to $HOME for example:

MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128

python scripts/remove_opt_state.py \
    $MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \
    checkpoint_1_105000_eval \
    --nproc 4 --resume-failed

Note you can do larger nproc (--nproc 32) on learnfair, but if you do it on devfair the code sometimes hangs.

move to somewhere in /large_experiments (I am fuzzy on the chown command)
update model_configs.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

sshleifer/remove_opt_state_instructions.md

Select an option

No results found

Select an option

No results found