Skip to content

Instantly share code, notes, and snippets.

@EvilFreelancer
Last active September 14, 2024 23:50
Show Gist options
  • Save EvilFreelancer/62ebc9a4c0ca41059d729e573d8e9f8c to your computer and use it in GitHub Desktop.
Save EvilFreelancer/62ebc9a4c0ca41059d729e573d8e9f8c to your computer and use it in GitHub Desktop.
Тестирование инференса llama.cpp в локальном и дистрибутивном режиме для https://habr.com/ru/articles/843372/

Локальное тестирование

Команда локального запуска:

time ./llama-cli -m /app/models/TinyLlama-1.1B-q4_0.gguf --prompt "Once upon a time" --n-predict 1024

Тестирование проводится на модели TinyLlama-1.1B-q4_0.gguf

1x RTX 4090

llama_perf_sampler_print:    sampling time =      13.13 ms /   521 runs   (    0.03 ms per token, 39680.12 tokens per second)
llama_perf_context_print:        load time =     181.94 ms
llama_perf_context_print: prompt eval time =      25.02 ms /     9 tokens (    2.78 ms per token,   359.78 tokens per second)
llama_perf_context_print:        eval time =    9465.53 ms /   511 runs   (   18.52 ms per token,    53.99 tokens per second)
llama_perf_context_print:       total time =    9520.78 ms /   520 tokens
Log end

real    0m9.784s
user    2m32.726s
sys     0m0.296s

1x RTX 3050

llama_perf_sampler_print:    sampling time =      23.53 ms /   521 runs   (    0.05 ms per token, 22143.83 tokens per second)
llama_perf_context_print:        load time =     355.73 ms
llama_perf_context_print: prompt eval time =      93.66 ms /     9 tokens (   10.41 ms per token,    96.09 tokens per second)
llama_perf_context_print:        eval time =   10937.60 ms /   511 runs   (   21.40 ms per token,    46.72 tokens per second)
llama_perf_context_print:       total time =   11115.95 ms /   520 tokens
Log end

real    0m11.600s
user    3m42.674s
sys     0m0.625s

Дистрибутивное тестирование (RPC)

Команда локального запуска:

time ./llama-cli -m /app/models/TinyLlama-1.1B-q4_0.gguf --prompt "Once upon a time" --n-predict 1024 --rpc gpu01:50252,gpu02:50252 -ngl 99

Тестирование проводится на модели TinyLlama-1.1B-q4_0.gguf

Запускается команда с третьего сервера в том же сегменте сети что gpu01 и gpu02.

1x RTX 3050 + 1x RTX 4090

llama_perf_sampler_print:    sampling time =      68.27 ms /   535 runs   (    0.13 ms per token,  7836.42 tokens per second)
llama_perf_context_print:        load time =   26389.93 ms
llama_perf_context_print: prompt eval time =      21.52 ms /     5 tokens (    4.30 ms per token,   232.37 tokens per second)
llama_perf_context_print:        eval time =    9244.89 ms /   529 runs   (   17.48 ms per token,    57.22 tokens per second)
llama_perf_context_print:       total time =    9451.38 ms /   534 tokens
Log end

real    0m36.024s
user    0m1.772s
sys     0m2.097s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment