All speeds captured on Apple M4 Max, 128GB Ram, using LM Studio
184.45 tok/sec, 8718 tokens, 0.04s to first token
Config: 10,000 token context limit, flash attention, evaluation batch size 4096
63.15 tok/sec, 2585 tokens, 0.14s to first token
Config: 10,000 token context limit, flash attention, evaluation batch size 4096
https://screen.studio/share/yVIhJeDz
70.15 tok/sec, 3142 tokens, 0.62s to first token
Config: 10,000 token context limit, flash attention, evaluation batch size 4096
https://screen.studio/share/DUHG3k2J
11.72 tok/sec, 5554 tokens, 0.81s to first token
Config: 10,000 token context limit, flash attention, evaluation batch size 4096
8.08 tok/sec, 8731 tokens, 1.87s to first token
Config: 10,000 token context limit, flash attention, evaluation batch size 4096
14.49 tok/sec, 1594 tokens, 1.71s to first token
Config params: Evaluation Batch Size 4096, Flash attention on, no speculative decoding, context length 131072
Labeled Config 1 in Uploads
12.99 tok/sec, 1733 tokens, 0.79s to first token
Config Params: Evaluation batch size 4096, flash attention off, Context Size 10000 Labeled Config 2 in Uploads
100.54 tok/sec, 1803 tokens, 0.14s to first token
Config params, Evaluation bathc size 4096, flash attention off, context size 4096
https://screen.studio/share/P85E0Scw
49.67 tok/sec, 1074 tokens, 0.26s to first token
Config: evaluation batch size 4096, flash attention off, Context 10k