edwardzjl/pretrain.md

Last active May 15, 2024 07:45

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/edwardzjl/b03c2359aec1023ca5f48b97b52d9eb5.js"></script>
Save edwardzjl/b03c2359aec1023ca5f48b97b52d9eb5 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

pretrain.md

Continual Pretrain

目标

小样本单机多卡验证环境配置
PipelineParallel or TensorParallel 验证该集群所能训练最大模型规模，以及所需的训练时间
验证灾难性遗忘程度
验证新数据学习程度

训练框架和准备工作

Python and venv (conda)
DeepSpeed latest (0.14.2) compiled from source with all C++/CUDA ops
OS user，hostfile，以及所有节点的 ssh connection
data storage: 先部署 nfs 快速启动项目，同时监控 IO 情况考虑是否部署 ceph / s3
monitoring: tensorboard / wandb

训练数据

教科书内容: 现存研究院超算 nfs，100Mb，scp 即可
m-a-p/Matrix: 粗略估算 ~20T，已下载 240G，剩余时间和存储都是大问题

模型验证

灾难性遗忘程度: ARC / HellaSwag / MMLU / TruthfulQA / Winogrande / GSM8K 等评估体系分数不应与 base model 相差太大
新数据学习程度: how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment