Skip to content

Instantly share code, notes, and snippets.

@edwardzjl
Last active May 15, 2024 07:45
Show Gist options
  • Save edwardzjl/b03c2359aec1023ca5f48b97b52d9eb5 to your computer and use it in GitHub Desktop.
Save edwardzjl/b03c2359aec1023ca5f48b97b52d9eb5 to your computer and use it in GitHub Desktop.

Continual Pretrain

目标

  • 小样本单机多卡验证环境配置
  • PipelineParallel or TensorParallel 验证该集群所能训练最大模型规模,以及所需的训练时间
  • 验证灾难性遗忘程度
  • 验证新数据学习程度

训练框架和准备工作

  • Python and venv (conda)
  • DeepSpeed latest (0.14.2) compiled from source with all C++/CUDA ops
  • OS user,hostfile,以及所有节点的 ssh connection
  • data storage: 先部署 nfs 快速启动项目,同时监控 IO 情况考虑是否部署 ceph / s3
  • monitoring: tensorboard / wandb

训练数据

  • 教科书内容: 现存研究院超算 nfs,100Mb,scp 即可
  • m-a-p/Matrix: 粗略估算 ~20T,已下载 240G,剩余时间和存储都是大问题

模型验证

  • 灾难性遗忘程度: ARC / HellaSwag / MMLU / TruthfulQA / Winogrande / GSM8K 等评估体系分数不应与 base model 相差太大
  • 新数据学习程度: how?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment