Skip to content

Instantly share code, notes, and snippets.

@AmosLewis
Last active March 4, 2025 17:35
Show Gist options
  • Save AmosLewis/5cdc024ac87355ae44d486142d4d96e0 to your computer and use it in GitHub Desktop.
Save AmosLewis/5cdc024ac87355ae44d486142d4d96e0 to your computer and use it in GitHub Desktop.
/home/chi/src/iree-build-trace/tools/iree-compile \
/sharedfile/attn/128/fp8_attn.mlir \
--iree-hip-target=gfx942 \
-o=/sharedfile/attn/128/fp8_attn.vmfb \
--iree-hal-target-device=hip \
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-global-opt-propagate-transposes=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
--iree-hal-indirect-command-buffers=true \
--iree-stream-resource-memory-model=discrete \
--iree-hal-memoization=true \
--iree-opt-strip-assertions
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
iree-benchmark-module \
--hip_use_streams=true \
--module=/sharedfile/attn/128/fp8_attn.vmfb \
--parameters=model=/sharedfile/attn/fp8_attn.irpa \
--device=hip://4 \
--function=prefill_bs4 \
--input=4x128xi64=@/sharedfile/128/prefill/prefill_token_ids_4x128xi64.bin \
--input=4xi64=@/sharedfile/128/prefill/prefill_seq_lens_4xi64.bin \
--input=4x4xi64=@/sharedfile/128/prefill/prefill_seq_block_ids_4x4xi64.bin \
--input=261x2097152xf8E4M3FNUZ=@/sharedfile/128/prefill/prefill_cache_state_261x2097152xf8E4M3FNUZ.bin \
--benchmark_repetitions=3
# [Codegen] Block dyn dims of parallel linalg.generic ops #20091
# +
# BLOCK +
# commit 747c06e68160562e7190ce1c7bf5fe774414b35e (HEAD)
# Author: Ian Wood <[email protected]>
# Date: Wed Feb 26 10:22:25 2025 -0800
# Don't hoist sequence-like ops (#20106)
# 2025-03-03T11:48:15-08:00
# Running /home/chi/src/iree-build/tools/iree-benchmark-module
# Run on (96 X 3810.79 MHz CPU s)
# CPU Caches:
# L1 Data 32 KiB (x96)
# L1 Instruction 32 KiB (x96)
# L2 Unified 1024 KiB (x96)
# L3 Unified 32768 KiB (x16)
# Load Average: 14.30, 10.88, 8.43
# ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
# ***WARNING*** Library was built as DEBUG. Timings may be affected.
# -------------------------------------------------------------------------------------------------------
# Benchmark Time CPU Iterations UserCounters...
# -------------------------------------------------------------------------------------------------------
# BM_prefill_bs4/process_time/real_time 98.9 ms 103 ms 7 items_per_second=10.1145/s
# BM_prefill_bs4/process_time/real_time 99.3 ms 104 ms 7 items_per_second=10.0698/s
# BM_prefill_bs4/process_time/real_time 99.5 ms 105 ms 7 items_per_second=10.0548/s
# BM_prefill_bs4/process_time/real_time_mean 99.2 ms 104 ms 3 items_per_second=10.0797/s
# BM_prefill_bs4/process_time/real_time_median 99.3 ms 104 ms 3 items_per_second=10.0698/s
# BM_prefill_bs4/process_time/real_time_stddev 0.305 ms 0.808 ms 3 items_per_second=0.0310233/s
# BM_prefill_bs4/process_time/real_time_cv 0.31 % 0.78 % 3 items_per_second=0.31%
# [Codegen] Block dyn dims of parallel linalg.generic ops #20091
# +
# commit 1aff06df0a70b454fea33278bee00705291cdadc (HEAD)
# Author: Zhuoran Yin <[email protected]>
# Date: Wed Feb 26 09:17:11 2025 -0500
# [codegen][gpu] Adding conv filter layout fhwc to preprocessing pipeline (#19974)
# 2025-03-03T11:56:26-08:00
# Running /home/chi/src/iree-build/tools/iree-benchmark-module
# Run on (96 X 3810.79 MHz CPU s)
# CPU Caches:
# L1 Data 32 KiB (x96)
# L1 Instruction 32 KiB (x96)
# L2 Unified 1024 KiB (x96)
# L3 Unified 32768 KiB (x16)
# Load Average: 4.28, 5.18, 6.54
# ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
# ***WARNING*** Library was built as DEBUG. Timings may be affected.
# -------------------------------------------------------------------------------------------------------
# Benchmark Time CPU Iterations UserCounters...
# -------------------------------------------------------------------------------------------------------
# BM_prefill_bs4/process_time/real_time 34.8 ms 38.1 ms 20 items_per_second=28.7463/s
# BM_prefill_bs4/process_time/real_time 34.9 ms 38.0 ms 20 items_per_second=28.6736/s
# BM_prefill_bs4/process_time/real_time 34.9 ms 37.2 ms 20 items_per_second=28.6206/s
# BM_prefill_bs4/process_time/real_time_mean 34.9 ms 37.8 ms 3 items_per_second=28.6802/s
# BM_prefill_bs4/process_time/real_time_median 34.9 ms 38.0 ms 3 items_per_second=28.6736/s
# BM_prefill_bs4/process_time/real_time_stddev 0.077 ms 0.474 ms 3 items_per_second=0.0630962/s
# BM_prefill_bs4/process_time/real_time_cv 0.22 % 1.26 % 3 items_per_second=0.22%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment