chenyaofo chenyaofo

We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.

Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.

Zero Stage	Ckpt.[^1]	Optim. Off.[^2]	Param. Off.[^3]	Zero++[^4]	BS[^5]	CPU Mem.[^6]	GPU Mem.[^7]	Th.put
2	×	×	×	×	1/64	320.1	19.4/44.8	5.33
2	√	×	×	×	1/64	320.0	19.4/23.5	4.19

	mamba create -n llmf python=3.10
	conda activate llmf
	pip config set global.index-url https://mirrors.bfsu.edu.cn/pypi/web/simple
	pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
	pip install transformers==4.48.3
	pip install datasets==3.2.0
	pip install accelerate==1.2.1
	pip install peft==0.12.0
	pip install trl==0.9.6
	unset LD_LIBRARY_PATH

	import torch
	import torch.nn as nn
	import torch.optim as optim
	import typing
	from functools import partial

	class BaseLineSearch:
	def _backup_parameters(self, parameters: typing.List[torch.Tensor]):
	self.original_params = [p.detach().clone() for p in parameters]

	{
	"api": {
	"services": [
	"HandlerService",
	"LoggerService",
	"StatsService"
	],
	"tag": "api"
	},
	"inbounds": [

	import torch

	b = 2
	s = 4
	h = 4
	d = 4
	device = "cuda"
	dtype = torch.bfloat16

	hidden_states = torch.rand((b,s, h*d), device=device, dtype=dtype)

	#!/usr/bin/env python3

	# Copyright 2021 The KubeEdge Authors.
	# Copyright 2020 kubeflow.org.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0

	import torch.package as package
	import torch
	import torchvision.transforms as T


	def get_train_transforms(crop_size, mean, std, is_training):
	pipelines = []
	if is_training:
	pipelines.append(T.RandomResizedCrop(crop_size))
	pipelines.append(T.RandomHorizontalFlip())

	port: 7890
	socks-port: 7891
	allow-lan: true
	mode: Global
	log-level: info
	external-controller: :9090

	profile:
	store-selected: true
	store-fake-ip: true


	import torch
	import torch.nn.functional as F
	from transformers.models.llama.modeling_llama import LlamaDecoderLayer, LlamaRMSNorm, LlamaConfig, LlamaForCausalLM
	import deepspeed
	from deepspeed.pipe import PipelineModule, LayerSpec


	class EmbeddingPipe(torch.nn.Embedding):
	def forward(self, args):

	import pathlib
	import loguru
	import dataclasses

	import deepspeed
	import torch4x
	from deepspeed import comm as dist


	import pprint