PyTorch LLM perf

stories15M

dtype	SOTA	2.2.2+eager	2.3.0+eager	2.3.0+compile	trunk + compile
bfloat16 (M1)		111 tokens/sec	110 tokens/sec	80 tokens/sec
float32 (M1)	687 tokens/sec	165 tokens/sec	176 tokens/sec
float16 (M1)	1106 tokens/sec	50 tokens/sec	187 tokens/sec
float16 (LinX86)		40 tokens/sec	43 tokens/sec	173 tokens/sec
float32 (LinX86)		38 tokens/sec	40 tokens/sec	179 tokens/sec
bfloat16 (LinX86)		73 tokens/sec	78 tokens/sec	180 tokens/sec
bfloat16 (M2Pro)		137 tokens/sec	147 tokens/sec	116 tokens/sec	228 tokens/sec
float32 (M2Pro)	947 tokens/sec	176 tokens/sec	301 tokens/sec	121 tokens/sec	460 tokens/sec
float16 (M2Pro)	1330 tokens/sec	56 tokens/sec	330 tokens/sec	116 tokens/sec	420 tokens/sec

Eager numbers are collected by running following script (gpt-fast gives a slightly higher number on eager as it preallocates KVCache even if it's longer then models context length):

 python run_llama.py --model-path stories15M.pt --random-seed 42 --dtype float16

Compile numbers are collected by running following script

python generate.py --checkpoint_path checkpoints/stories15M/stories15M.pt --prompt "Once upon a time" --compile --device cpu

P.S. Compile does not work out of the box on Mac with 2.3.0 right now, one needs to symlink libiomp into the right location LinX86 is Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

llama2-7b / open_llama_7b

dtype	SOTA	2.3.0+eager	2.3.0+compile	trunk + compile
float16 (LinX86)
float32 (LinX86)	3 tokens/sec	4(5?) tok/sec	4 tok/sec
bfloat16 (LinX86)	5 tokens/sec	3(6?) tok/sec	6 tok/sec
float16 (M2Pro)	8 tokens/sec	1.5 tok/sec	.9 tok/sec	1.5 tok/sec
float16 (M2Pro+MPS)	13 tokens/sec	9.7 tok/sec
float32 (M2Pro)

malfet/pytorch-perf.md

stories15M

llama2-7b / open_llama_7b