Skip to content

Instantly share code, notes, and snippets.

@AmosLewis
Last active February 5, 2025 18:17
Show Gist options
  • Save AmosLewis/29cffad03306cc98f07c45135bfb8a5b to your computer and use it in GitHub Desktop.
Save AmosLewis/29cffad03306cc98f07c45135bfb8a5b to your computer and use it in GitHub Desktop.

prefill_bs1

ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build/tools/iree-benchmark-module \
--hip_use_streams=true \
--module=f8_.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=prefill_bs1 \
--input=1x32xi64=@/sharedfile/prefill/prefill_token_ids_1_32.bin \
--input=1xi64=@/sharedfile/prefill/prefill_seq_lens_1.bin \
--input=1x1xi64=@/sharedfile/prefill/prefill_seq_block_ids_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/prefill/prefill_cache_state_128_2097152.bin
2025-02-05T10:16:24-08:00
Running /home/chi/src/iree-build/tools/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 8.62, 17.49, 16.62
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
BM_prefill_bs1/process_time/real_time       24.4 ms         45.8 ms           30 items_per_second=40.9946/s

decode_bs1

ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/home/chi/src/iree-build/tools/iree-benchmark-module \
--hip_use_streams=true \
--module=f8_.vmfb \
--parameters=model=fp8.irpa \
--device=hip://4 \
--function=decode_bs1 \
--input=1x1xi64=@/sharedfile/decode/decode_next_tokens_1_1.bin \
--input=1xi64=@/sharedfile/decode/decode_seq_lens_1.bin \
--input=1xi64=@/sharedfile/decode/decode_start_positions_1.bin \
--input=1x1xi64=@/sharedfile/decode/decode_seq_block_ids_tensor_1_1.bin \
--input=128x2097152xf8E4M3FNUZ=@/sharedfile/decode/decode_cache_state_128_2097152.bin \
--benchmark_repetitions=3
2025-02-05T10:13:54-08:00
Running /home/chi/src/iree-build/tools/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 14.40, 22.54, 17.78
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------
BM_decode_bs1/process_time/real_time              22.9 ms         41.1 ms           33 items_per_second=43.7575/s
BM_decode_bs1/process_time/real_time              22.9 ms         41.0 ms           33 items_per_second=43.5971/s
BM_decode_bs1/process_time/real_time              22.6 ms         40.6 ms           33 items_per_second=44.2047/s
BM_decode_bs1/process_time/real_time_mean         22.8 ms         40.9 ms            3 items_per_second=43.8531/s
BM_decode_bs1/process_time/real_time_median       22.9 ms         41.0 ms            3 items_per_second=43.7575/s
BM_decode_bs1/process_time/real_time_stddev      0.163 ms        0.237 ms            3 items_per_second=0.314846/s
BM_decode_bs1/process_time/real_time_cv           0.72 %          0.58 %             3 items_per_second=0.72%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment