Created
October 12, 2025 23:44
-
-
Save sshleifer/17e3d3d40632912878eb2353dbc32ad2 to your computer and use it in GitHub Desktop.
Triton MLP Benchmark Output - Dense and MoE Performance Results
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # MLP Benchmark Output | |
| # Command: python bench_mlp.py | |
| # Date: 2025-10-12 | |
| ## Summary | |
| Successfully ran benchmarks for: | |
| - Dense MLP (fp8x-mx4w-TP1-EP1) | |
| - MoE with FP8 weights (fp8x-fp8w-TP1-EP1) | |
| - MoE with MX4 weights at TP=1,2,4 | |
| Failed on TP=8 due to stride alignment issue (unrelated to syntax fixes). | |
| ## Full Output | |
| ========================================= | |
| logs/dense/fp8x-mx4w-TP1-EP1.csv... | |
| ========================================= | |
| batch_per_expt: 128 | MS: 4.27 | TFLOPS: 603.9 | TBPS: 1.25 | |
| batch_per_expt: 256 | MS: 4.39 | TFLOPS: 1173. | TBPS: 1.29 | |
| batch_per_expt: 384 | MS: 4.52 | TFLOPS: 1711. | TBPS: 1.32 | |
| batch_per_expt: 512 | MS: 4.72 | TFLOPS: 2183. | TBPS: 1.33 | |
| batch_per_expt: 640 | MS: 7.60 | TFLOPS: 1695. | TBPS: 0.87 | |
| batch_per_expt: 768 | MS: 7.58 | TFLOPS: 2040. | TBPS: 0.91 | |
| batch_per_expt: 896 | MS: 7.71 | TFLOPS: 2340. | TBPS: 0.94 | |
| batch_per_expt: 1024 | MS: 7.94 | TFLOPS: 2598. | TBPS: 0.95 | |
| batch_per_expt: 1152 | MS: 8.31 | TFLOPS: 2790. | TBPS: 0.95 | |
| batch_per_expt: 1280 | MS: 11.03 | TFLOPS: 2336. | TBPS: 0.74 | |
| batch_per_expt: 1408 | MS: 11.12 | TFLOPS: 2550. | TBPS: 0.76 | |
| batch_per_expt: 1536 | MS: 11.33 | TFLOPS: 2729. | TBPS: 0.78 | |
| batch_per_expt: 1664 | MS: 11.93 | TFLOPS: 2807. | TBPS: 0.76 | |
| batch_per_expt: 1792 | MS: 12.71 | TFLOPS: 2838. | TBPS: 0.74 | |
| batch_per_expt: 1920 | MS: 15.90 | TFLOPS: 2431. | TBPS: 0.61 | |
| batch_per_expt: 2048 | MS: 14.67 | TFLOPS: 2811. | TBPS: 0.69 | |
| batch_per_expt: 2176 | MS: 14.88 | TFLOPS: 2945. | TBPS: 0.70 | |
| batch_per_expt: 2304 | MS: 15.01 | TFLOPS: 3090. | TBPS: 0.71 | |
| batch_per_expt: 2432 | MS: 15.12 | TFLOPS: 3239. | TBPS: 0.73 | |
| batch_per_expt: 2560 | MS: 17.93 | TFLOPS: 2874. | TBPS: 0.63 | |
| batch_per_expt: 2688 | MS: 18.12 | TFLOPS: 2986. | TBPS: 0.64 | |
| batch_per_expt: 2816 | MS: 18.06 | TFLOPS: 3139. | TBPS: 0.66 | |
| batch_per_expt: 2944 | MS: 18.17 | TFLOPS: 3262. | TBPS: 0.68 | |
| batch_per_expt: 3072 | MS: 21.16 | TFLOPS: 2922. | TBPS: 0.59 | |
| batch_per_expt: 3200 | MS: 21.14 | TFLOPS: 3048. | TBPS: 0.61 | |
| batch_per_expt: 3328 | MS: 21.33 | TFLOPS: 3141. | TBPS: 0.62 | |
| batch_per_expt: 3456 | MS: 21.52 | TFLOPS: 3233. | TBPS: 0.63 | |
| batch_per_expt: 3584 | MS: 21.49 | TFLOPS: 3358. | TBPS: 0.64 | |
| batch_per_expt: 3712 | MS: 24.53 | TFLOPS: 3047. | TBPS: 0.58 | |
| batch_per_expt: 3840 | MS: 24.73 | TFLOPS: 3127. | TBPS: 0.59 | |
| batch_per_expt: 3968 | MS: 24.78 | TFLOPS: 3224. | TBPS: 0.60 | |
| batch_per_expt: 4096 | MS: 24.69 | TFLOPS: 3340. | TBPS: 0.61 | |
| batch_per_expt: 4224 | MS: 25.31 | TFLOPS: 3360. | TBPS: 0.61 | |
| batch_per_expt: 4352 | MS: 27.74 | TFLOPS: 3158. | TBPS: 0.57 | |
| batch_per_expt: 4480 | MS: 28.01 | TFLOPS: 3220. | TBPS: 0.57 | |
| batch_per_expt: 4608 | MS: 28.48 | TFLOPS: 3258. | TBPS: 0.57 | |
| batch_per_expt: 4736 | MS: 39.55 | TFLOPS: 2411. | TBPS: 0.42 | |
| batch_per_expt: 4864 | MS: 28.15 | TFLOPS: 3479. | TBPS: 0.60 | |
| batch_per_expt: 4992 | MS: 71.33 | TFLOPS: 1409. | TBPS: 0.24 | |
| batch_per_expt: 5120 | MS: 74.75 | TFLOPS: 1379. | TBPS: 0.24 | |
| batch_per_expt: 5248 | MS: 71.19 | TFLOPS: 1484. | TBPS: 0.25 | |
| batch_per_expt: 5376 | MS: 63.53 | TFLOPS: 1704. | TBPS: 0.29 | |
| batch_per_expt: 5504 | MS: 82.03 | TFLOPS: 1351. | TBPS: 0.23 | |
| batch_per_expt: 5632 | MS: 82.24 | TFLOPS: 1379. | TBPS: 0.23 | |
| batch_per_expt: 5760 | MS: 83.14 | TFLOPS: 1395. | TBPS: 0.23 | |
| batch_per_expt: 5888 | MS: 82.93 | TFLOPS: 1429. | TBPS: 0.24 | |
| batch_per_expt: 6016 | MS: 78.04 | TFLOPS: 1552. | TBPS: 0.25 | |
| batch_per_expt: 6144 | MS: 91.62 | TFLOPS: 1350. | TBPS: 0.22 | |
| batch_per_expt: 6272 | MS: 91.08 | TFLOPS: 1386. | TBPS: 0.22 | |
| batch_per_expt: 6400 | MS: 89.21 | TFLOPS: 1444. | TBPS: 0.23 | |
| batch_per_expt: 6528 | MS: 92.63 | TFLOPS: 1419. | TBPS: 0.23 | |
| batch_per_expt: 6656 | MS: 87.88 | TFLOPS: 1525. | TBPS: 0.24 | |
| batch_per_expt: 6784 | MS: 69.12 | TFLOPS: 1976. | TBPS: 0.31 | |
| batch_per_expt: 6912 | MS: 42.63 | TFLOPS: 3265. | TBPS: 0.52 | |
| batch_per_expt: 7040 | MS: 51.83 | TFLOPS: 2735. | TBPS: 0.43 | |
| batch_per_expt: 7168 | MS: 100.23 | TFLOPS: 1440. | TBPS: 0.23 | |
| batch_per_expt: 7296 | MS: 94.11 | TFLOPS: 1561. | TBPS: 0.24 | |
| batch_per_expt: 7424 | MS: 100.39 | TFLOPS: 1489. | TBPS: 0.23 | |
| batch_per_expt: 7552 | MS: 98.44 | TFLOPS: 1545. | TBPS: 0.24 | |
| batch_per_expt: 7680 | MS: 100.44 | TFLOPS: 1539. | TBPS: 0.24 | |
| batch_per_expt: 7808 | MS: 101.64 | TFLOPS: 1547. | TBPS: 0.24 | |
| batch_per_expt: 7936 | MS: 105.66 | TFLOPS: 1512. | TBPS: 0.23 | |
| batch_per_expt: 8064 | MS: 110.69 | TFLOPS: 1467. | TBPS: 0.22 | |
| ## Error on TP=8 Configuration | |
| Traceback (most recent call last): | |
| File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 159, in <module> | |
| roofline_mlp(batch_sizes_moe, 5760, 5760, 128, 4, quantized_dtypes[0], quantized_dtypes[1], TP=8, EP=1, | |
| File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 98, in roofline_mlp | |
| csv_path = roofline.compute_roofline(dim1, dim2, n_expts_tot, n_expts_act, x_dtype, w_dtype, TP, EP, # fixed args | |
| File "/home/sam/triton/python/triton_kernels/triton_kernels/roofline.py", line 73, in compute_roofline | |
| perf = inject_proxy_and_call(val, args, kwargs) | |
| File "/home/sam/triton/python/triton_kernels/triton_kernels/roofline.py", line 64, in inject_proxy_and_call | |
| return bench_fn(*args_list, **kwargs) | |
| File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 86, in bench_mlp | |
| x = matmul_ogs(x, w1, b1, rdata, gather_indx=gather_indx, precision_config=pc1, fused_activation=act) | |
| File "/home/sam/triton/python/triton_kernels/triton_kernels/matmul_ogs.py", line 602, in matmul_ogs | |
| y_tensor_or_tma = y_storage.make_tma(y_tma_block_size, y_tma_mode) if y_has_tma else y_storage.data | |
| File "/home/sam/triton/python/triton_kernels/triton_kernels/tensor.py", line 68, in make_tma | |
| return create_ragged_descriptor(self.data, block_shape, ragged_dim=ragged_dim) | |
| File "/home/sam/triton/python/triton/tools/ragged_tma.py", line 45, in create_ragged_descriptor | |
| return TensorDescriptor(T, tma_shape, tma_stride, box_shape) | |
| File "<string>", line 8, in __init__ | |
| File "/home/sam/triton/python/triton/tools/tensor_descriptor.py", line 26, in __post_init__ | |
| assert (stride * elem_bytes) % 16 == 0, "strides must be 16-byte aligned" | |
| AssertionError: strides must be 16-byte aligned |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment