ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build/tools/iree-benchmark-module \
--hip_use_streams=true \
--module=f8_.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
2025-02-05T10:16:24-08:00
Running /home/chi/src/iree-build/tools/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x96)
L1 Instruction 32 KiB (x96)
L2 Unified 1024 KiB (x96)
L3 Unified 32768 KiB (x16)
Load Average: 8.62, 17.49, 16.62
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------
BM_prefill_bs1/process_time/real_time 24.4 ms 45.8 ms 30 items_per_second=40.9946/s
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build/tools/iree-benchmark-module \
--hip_use_streams=true \
--module=f8_.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=decode_bs1 \
--input=1x1xi64=@/sharedfile/decode/decode_next_tokens_1_1.bin \
--input=1xi64=@/sharedfile/decode/decode_seq_lens_1.bin \
--input=1xi64=@/sharedfile/decode/decode_start_positions_1.bin \
--input=1x1xi64=@/sharedfile/decode/decode_seq_block_ids_tensor_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/decode/decode_cache_state_128_2097152.bin \
--benchmark_repetitions=3
2025-02-05T10:13:54-08:00
Running /home/chi/src/iree-build/tools/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x96)
L1 Instruction 32 KiB (x96)
L2 Unified 1024 KiB (x96)
L3 Unified 32768 KiB (x16)
Load Average: 14.40, 22.54, 17.78
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------
BM_decode_bs1/process_time/real_time 22.9 ms 41.1 ms 33 items_per_second=43.7575/s
BM_decode_bs1/process_time/real_time 22.9 ms 41.0 ms 33 items_per_second=43.5971/s
BM_decode_bs1/process_time/real_time 22.6 ms 40.6 ms 33 items_per_second=44.2047/s
BM_decode_bs1/process_time/real_time_mean 22.8 ms 40.9 ms 3 items_per_second=43.8531/s
BM_decode_bs1/process_time/real_time_median 22.9 ms 41.0 ms 3 items_per_second=43.7575/s
BM_decode_bs1/process_time/real_time_stddev 0.163 ms 0.237 ms 3 items_per_second=0.314846/s
BM_decode_bs1/process_time/real_time_cv 0.72 % 0.58 % 3 items_per_second=0.72%