Skip to content

Instantly share code, notes, and snippets.

@sshleifer
Created October 12, 2025 23:44
Show Gist options
  • Save sshleifer/17e3d3d40632912878eb2353dbc32ad2 to your computer and use it in GitHub Desktop.
Save sshleifer/17e3d3d40632912878eb2353dbc32ad2 to your computer and use it in GitHub Desktop.
Triton MLP Benchmark Output - Dense and MoE Performance Results
# MLP Benchmark Output
# Command: python bench_mlp.py
# Date: 2025-10-12
## Summary
Successfully ran benchmarks for:
- Dense MLP (fp8x-mx4w-TP1-EP1)
- MoE with FP8 weights (fp8x-fp8w-TP1-EP1)
- MoE with MX4 weights at TP=1,2,4
Failed on TP=8 due to stride alignment issue (unrelated to syntax fixes).
## Full Output
=========================================
logs/dense/fp8x-mx4w-TP1-EP1.csv...
=========================================
batch_per_expt: 128 | MS: 4.27 | TFLOPS: 603.9 | TBPS: 1.25
batch_per_expt: 256 | MS: 4.39 | TFLOPS: 1173. | TBPS: 1.29
batch_per_expt: 384 | MS: 4.52 | TFLOPS: 1711. | TBPS: 1.32
batch_per_expt: 512 | MS: 4.72 | TFLOPS: 2183. | TBPS: 1.33
batch_per_expt: 640 | MS: 7.60 | TFLOPS: 1695. | TBPS: 0.87
batch_per_expt: 768 | MS: 7.58 | TFLOPS: 2040. | TBPS: 0.91
batch_per_expt: 896 | MS: 7.71 | TFLOPS: 2340. | TBPS: 0.94
batch_per_expt: 1024 | MS: 7.94 | TFLOPS: 2598. | TBPS: 0.95
batch_per_expt: 1152 | MS: 8.31 | TFLOPS: 2790. | TBPS: 0.95
batch_per_expt: 1280 | MS: 11.03 | TFLOPS: 2336. | TBPS: 0.74
batch_per_expt: 1408 | MS: 11.12 | TFLOPS: 2550. | TBPS: 0.76
batch_per_expt: 1536 | MS: 11.33 | TFLOPS: 2729. | TBPS: 0.78
batch_per_expt: 1664 | MS: 11.93 | TFLOPS: 2807. | TBPS: 0.76
batch_per_expt: 1792 | MS: 12.71 | TFLOPS: 2838. | TBPS: 0.74
batch_per_expt: 1920 | MS: 15.90 | TFLOPS: 2431. | TBPS: 0.61
batch_per_expt: 2048 | MS: 14.67 | TFLOPS: 2811. | TBPS: 0.69
batch_per_expt: 2176 | MS: 14.88 | TFLOPS: 2945. | TBPS: 0.70
batch_per_expt: 2304 | MS: 15.01 | TFLOPS: 3090. | TBPS: 0.71
batch_per_expt: 2432 | MS: 15.12 | TFLOPS: 3239. | TBPS: 0.73
batch_per_expt: 2560 | MS: 17.93 | TFLOPS: 2874. | TBPS: 0.63
batch_per_expt: 2688 | MS: 18.12 | TFLOPS: 2986. | TBPS: 0.64
batch_per_expt: 2816 | MS: 18.06 | TFLOPS: 3139. | TBPS: 0.66
batch_per_expt: 2944 | MS: 18.17 | TFLOPS: 3262. | TBPS: 0.68
batch_per_expt: 3072 | MS: 21.16 | TFLOPS: 2922. | TBPS: 0.59
batch_per_expt: 3200 | MS: 21.14 | TFLOPS: 3048. | TBPS: 0.61
batch_per_expt: 3328 | MS: 21.33 | TFLOPS: 3141. | TBPS: 0.62
batch_per_expt: 3456 | MS: 21.52 | TFLOPS: 3233. | TBPS: 0.63
batch_per_expt: 3584 | MS: 21.49 | TFLOPS: 3358. | TBPS: 0.64
batch_per_expt: 3712 | MS: 24.53 | TFLOPS: 3047. | TBPS: 0.58
batch_per_expt: 3840 | MS: 24.73 | TFLOPS: 3127. | TBPS: 0.59
batch_per_expt: 3968 | MS: 24.78 | TFLOPS: 3224. | TBPS: 0.60
batch_per_expt: 4096 | MS: 24.69 | TFLOPS: 3340. | TBPS: 0.61
batch_per_expt: 4224 | MS: 25.31 | TFLOPS: 3360. | TBPS: 0.61
batch_per_expt: 4352 | MS: 27.74 | TFLOPS: 3158. | TBPS: 0.57
batch_per_expt: 4480 | MS: 28.01 | TFLOPS: 3220. | TBPS: 0.57
batch_per_expt: 4608 | MS: 28.48 | TFLOPS: 3258. | TBPS: 0.57
batch_per_expt: 4736 | MS: 39.55 | TFLOPS: 2411. | TBPS: 0.42
batch_per_expt: 4864 | MS: 28.15 | TFLOPS: 3479. | TBPS: 0.60
batch_per_expt: 4992 | MS: 71.33 | TFLOPS: 1409. | TBPS: 0.24
batch_per_expt: 5120 | MS: 74.75 | TFLOPS: 1379. | TBPS: 0.24
batch_per_expt: 5248 | MS: 71.19 | TFLOPS: 1484. | TBPS: 0.25
batch_per_expt: 5376 | MS: 63.53 | TFLOPS: 1704. | TBPS: 0.29
batch_per_expt: 5504 | MS: 82.03 | TFLOPS: 1351. | TBPS: 0.23
batch_per_expt: 5632 | MS: 82.24 | TFLOPS: 1379. | TBPS: 0.23
batch_per_expt: 5760 | MS: 83.14 | TFLOPS: 1395. | TBPS: 0.23
batch_per_expt: 5888 | MS: 82.93 | TFLOPS: 1429. | TBPS: 0.24
batch_per_expt: 6016 | MS: 78.04 | TFLOPS: 1552. | TBPS: 0.25
batch_per_expt: 6144 | MS: 91.62 | TFLOPS: 1350. | TBPS: 0.22
batch_per_expt: 6272 | MS: 91.08 | TFLOPS: 1386. | TBPS: 0.22
batch_per_expt: 6400 | MS: 89.21 | TFLOPS: 1444. | TBPS: 0.23
batch_per_expt: 6528 | MS: 92.63 | TFLOPS: 1419. | TBPS: 0.23
batch_per_expt: 6656 | MS: 87.88 | TFLOPS: 1525. | TBPS: 0.24
batch_per_expt: 6784 | MS: 69.12 | TFLOPS: 1976. | TBPS: 0.31
batch_per_expt: 6912 | MS: 42.63 | TFLOPS: 3265. | TBPS: 0.52
batch_per_expt: 7040 | MS: 51.83 | TFLOPS: 2735. | TBPS: 0.43
batch_per_expt: 7168 | MS: 100.23 | TFLOPS: 1440. | TBPS: 0.23
batch_per_expt: 7296 | MS: 94.11 | TFLOPS: 1561. | TBPS: 0.24
batch_per_expt: 7424 | MS: 100.39 | TFLOPS: 1489. | TBPS: 0.23
batch_per_expt: 7552 | MS: 98.44 | TFLOPS: 1545. | TBPS: 0.24
batch_per_expt: 7680 | MS: 100.44 | TFLOPS: 1539. | TBPS: 0.24
batch_per_expt: 7808 | MS: 101.64 | TFLOPS: 1547. | TBPS: 0.24
batch_per_expt: 7936 | MS: 105.66 | TFLOPS: 1512. | TBPS: 0.23
batch_per_expt: 8064 | MS: 110.69 | TFLOPS: 1467. | TBPS: 0.22
## Error on TP=8 Configuration
Traceback (most recent call last):
File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 159, in <module>
roofline_mlp(batch_sizes_moe, 5760, 5760, 128, 4, quantized_dtypes[0], quantized_dtypes[1], TP=8, EP=1,
File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 98, in roofline_mlp
csv_path = roofline.compute_roofline(dim1, dim2, n_expts_tot, n_expts_act, x_dtype, w_dtype, TP, EP, # fixed args
File "/home/sam/triton/python/triton_kernels/triton_kernels/roofline.py", line 73, in compute_roofline
perf = inject_proxy_and_call(val, args, kwargs)
File "/home/sam/triton/python/triton_kernels/triton_kernels/roofline.py", line 64, in inject_proxy_and_call
return bench_fn(*args_list, **kwargs)
File "/home/sam/triton/python/triton_kernels/bench/bench_mlp.py", line 86, in bench_mlp
x = matmul_ogs(x, w1, b1, rdata, gather_indx=gather_indx, precision_config=pc1, fused_activation=act)
File "/home/sam/triton/python/triton_kernels/triton_kernels/matmul_ogs.py", line 602, in matmul_ogs
y_tensor_or_tma = y_storage.make_tma(y_tma_block_size, y_tma_mode) if y_has_tma else y_storage.data
File "/home/sam/triton/python/triton_kernels/triton_kernels/tensor.py", line 68, in make_tma
return create_ragged_descriptor(self.data, block_shape, ragged_dim=ragged_dim)
File "/home/sam/triton/python/triton/tools/ragged_tma.py", line 45, in create_ragged_descriptor
return TensorDescriptor(T, tma_shape, tma_stride, box_shape)
File "<string>", line 8, in __init__
File "/home/sam/triton/python/triton/tools/tensor_descriptor.py", line 26, in __post_init__
assert (stride * elem_bytes) % 16 == 0, "strides must be 16-byte aligned"
AssertionError: strides must be 16-byte aligned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment