./main -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -n 500 --ignore-eos -f prompts/chat-dishes.txt
./main -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -t 3 -n 500 --ignore-eos -f prompts/chat-dishes.txt
no BLAS 4 threads
llama_print_timings: load time = 459.67 ms
llama_print_timings: sample time = 251.73 ms / 500 runs ( 0.50 ms per token, 1986.24 tokens per second)
llama_print_timings: prompt eval time = 10175.15 ms / 68 tokens ( 149.63 ms per token, 6.68 tokens per second)
llama_print_timings: eval time = 133404.92 ms / 499 runs ( 267.34 ms per token, 3.74 tokens per second)
llama_print_timings: total time = 144601.38 ms
no BLAS 3 threads
llama_print_timings: load time = 523.47 ms
llama_print_timings: sample time = 246.28 ms / 500 runs ( 0.49 ms per token, 2030.22 tokens per second)
llama_print_timings: prompt eval time = 12365.74 ms / 68 tokens ( 181.85 ms per token, 5.50 tokens per second)
llama_print_timings: eval time = 117291.59 ms / 499 runs ( 235.05 ms per token, 4.25 tokens per second)
llama_print_timings: total time = 130545.96 ms
BLAS 4 threads
llama_print_timings: load time = 541.36 ms
llama_print_timings: sample time = 257.49 ms / 500 runs ( 0.51 ms per token, 1941.80 tokens per second)
llama_print_timings: prompt eval time = 16855.31 ms / 68 tokens ( 247.87 ms per token, 4.03 tokens per second)
llama_print_timings: eval time = 132333.06 ms / 499 runs ( 265.20 ms per token, 3.77 tokens per second)
llama_print_timings: total time = 150086.40 ms
BLAS 3 threads
llama_print_timings: load time = 508.27 ms
llama_print_timings: sample time = 247.93 ms / 500 runs ( 0.50 ms per token, 2016.73 tokens per second)
llama_print_timings: prompt eval time = 16314.19 ms / 68 tokens ( 239.91 ms per token, 4.17 tokens per second)
llama_print_timings: eval time = 117396.94 ms / 499 runs ( 235.26 ms per token, 4.25 tokens per second)
llama_print_timings: total time = 134640.20 ms
./lookup -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -t 3 -n 128 --ignore-eos -f prompts/chat-summary.txt --draft 2 --color
decoded 129 tokens in 33.575 seconds, speed: 3.842 t/s
./lookup -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -t 3 -n 128 --ignore-eos -f prompts/chat-summary.txt --draft 3 --color
decoded 129 tokens in 36.597 seconds, speed: 3.525 t/s
Q4_K_M
llama_print_timings: load time = 432.79 ms
llama_print_timings: sample time = 242.93 ms / 500 runs ( 0.49 ms per token, 2058.18 tokens per second)
llama_print_timings: prompt eval time = 15477.78 ms / 68 tokens ( 227.61 ms per token, 4.39 tokens per second)
llama_print_timings: eval time = 104818.23 ms / 499 runs ( 210.06 ms per token, 4.76 tokens per second)
llama_print_timings: total time = 120907.47 ms
Q4_K_S
llama_print_timings: load time = 417.70 ms
llama_print_timings: sample time = 240.47 ms / 500 runs ( 0.48 ms per token, 2079.24 tokens per second)
llama_print_timings: prompt eval time = 14911.31 ms / 68 tokens ( 219.28 ms per token, 4.56 tokens per second)
llama_print_timings: eval time = 101519.09 ms / 499 runs ( 203.45 ms per token, 4.92 tokens per second)
llama_print_timings: total time = 117041.52 ms
Launch inference server with:
./server -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -t 3