dtype | SOTA | 2.2.2+eager | 2.3.0+eager | 2.3.0+compile | trunk + compile |
---|---|---|---|---|---|
bfloat16 (M1) | 111 tokens/sec | 110 tokens/sec | 80 tokens/sec | ||
float32 (M1) | 687 tokens/sec | 165 tokens/sec | 176 tokens/sec | ||
float16 (M1) | 1106 tokens/sec | 50 tokens/sec | 187 tokens/sec | ||
float16 (LinX86) | 40 tokens/sec | 43 tokens/sec | 173 tokens/sec | ||
float32 (LinX86) | 38 tokens/sec | 40 tokens/sec | 179 tokens/sec | ||
bfloat16 (LinX86) | 73 tokens/sec | 78 tokens/sec | 180 tokens/sec | ||
bfloat16 (M2Pro) | 137 tokens/sec | 147 tokens/sec | 116 tokens/sec | 228 tokens/sec | |
float32 (M2Pro) | 947 tokens/sec | 176 tokens/sec | 301 tokens/sec | 121 tokens/sec | 460 tokens/sec |
float16 (M2Pro) | 1330 tokens/sec | 56 tokens/sec | 330 tokens/sec | 116 tokens/sec | 420 tokens/sec |
Eager numbers are collected by running following script (gpt-fast gives a slightly higher number on eager as it preallocates KVCache even if it's longer then models context length):
python run_llama.py --model-path stories15M.pt --random-seed 42 --dtype float16
Compile numbers are collected by running following script
python generate.py --checkpoint_path checkpoints/stories15M/stories15M.pt --prompt "Once upon a time" --compile --device cpu
P.S. Compile does not work out of the box on Mac with 2.3.0 right now, one needs to symlink libiomp into the right location LinX86 is Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
dtype | SOTA | 2.2.2+eager | 2.3.0+eager | 2.3.0+compile | trunk | trunk + compile |
---|---|---|---|---|---|---|
float16 (LinX86) | ||||||
float32 (LinX86) | 3 tokens/sec | 4(5?) tok/sec | 4 tok/sec | |||
bfloat16 (LinX86) | 5 tokens/sec | 3(6?) tok/sec | 6 tok/sec | |||
float16 (M2Pro) | 8 tokens/sec | 1.5 tok/sec | .9 tok/sec | 1.5 tok/sec | ||
float16 (M2Pro+MPS) | 13 tokens/sec | 9.7 tok/sec | ||||
float32 (M2Pro) |