Skip to content

Instantly share code, notes, and snippets.

@leslie-fang-intel
Created September 26, 2024 05:01
Show Gist options
  • Select an option

  • Save leslie-fang-intel/00a5777229e8a5a850739373cb48c88f to your computer and use it in GitHub Desktop.

Select an option

Save leslie-fang-intel/00a5777229e8a5a850739373cb48c88f to your computer and use it in GitHub Desktop.
## All shapes
* input tokens 1024; output tokens 128; BS 1
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
cpp_packed_gemm_0 10.8958 ms 100.0%
_weight_int8pack_mm 50.9464 ms 21.4%
SingleProcess AUTOTUNE benchmarking takes 1.0826 seconds and 1.8839 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
cpp_packed_gemm_4 24.0196 ms 100.0%
_weight_int8pack_mm 119.4106 ms 20.1%
SingleProcess AUTOTUNE benchmarking takes 1.8292 seconds and 1.8225 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
cpp_packed_gemm_6 24.5435 ms 100.0%
_weight_int8pack_mm 119.4160 ms 20.6%
SingleProcess AUTOTUNE benchmarking takes 1.8293 seconds and 1.8502 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
cpp_packed_gemm_224 77.5536 ms 100.0%
_weight_int8pack_mm 390.3881 ms 19.9%
SingleProcess AUTOTUNE benchmarking takes 4.4929 seconds and 1.8018 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 4096x4096, 4096)
_weight_int8pack_mm 0.0626 ms 100.0%
cpp_packed_gemm_225 0.0790 ms 79.3%
SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 1.7496 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 11008x4096, 11008)
_weight_int8pack_mm 0.1422 ms 100.0%
cpp_packed_gemm_229 0.1727 ms 82.3%
SingleProcess AUTOTUNE benchmarking takes 0.2500 seconds and 1.6883 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x11008, 4096x11008, 4096)
_weight_int8pack_mm 0.1180 ms 100.0%
cpp_packed_gemm_231 0.1612 ms 73.2%
SingleProcess AUTOTUNE benchmarking takes 0.2476 seconds and 1.6865 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 32000x4096, 32000)
_weight_int8pack_mm 0.3984 ms 100.0%
cpp_packed_gemm_449 0.4407 ms 90.4%
SingleProcess AUTOTUNE benchmarking takes 0.2574 seconds and 1.6849 seconds precompiling
```
* input tokens 1024; output tokens 128; BS 2
```
AUTOTUNE _weight_int8pack_mm(8192x4096, 4096x4096, 4096)
cpp_packed_gemm_0 22.8734 ms 100.0%
_weight_int8pack_mm 100.1233 ms 22.8%
SingleProcess AUTOTUNE benchmarking takes 1.8709 seconds and 2.0644 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8192x4096, 11008x4096, 11008)
cpp_packed_gemm_4 48.8860 ms 100.0%
_weight_int8pack_mm 251.6516 ms 19.4%
SingleProcess AUTOTUNE benchmarking takes 3.5835 seconds and 1.7815 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8192x11008, 4096x11008, 4096)
cpp_packed_gemm_6 51.5397 ms 100.0%
_weight_int8pack_mm 239.0896 ms 21.6%
SingleProcess AUTOTUNE benchmarking takes 3.4701 seconds and 1.8435 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8192x4096, 32000x4096, 32000)
cpp_packed_gemm_224 179.3097 ms 100.0%
_weight_int8pack_mm 827.3955 ms 21.7%
SingleProcess AUTOTUNE benchmarking takes 8.9030 seconds and 1.7991 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 4096x4096, 4096)
_weight_int8pack_mm 0.1092 ms 100.0%
cpp_packed_gemm_225 0.1345 ms 81.2%
SingleProcess AUTOTUNE benchmarking takes 0.2479 seconds and 1.8676 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 11008x4096, 11008)
_weight_int8pack_mm 0.2760 ms 100.0%
cpp_packed_gemm_229 0.3062 ms 90.1%
SingleProcess AUTOTUNE benchmarking takes 0.2507 seconds and 1.6980 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x11008, 4096x11008, 4096)
_weight_int8pack_mm 0.2312 ms 100.0%
cpp_packed_gemm_231 0.2743 ms 84.3%
SingleProcess AUTOTUNE benchmarking takes 0.2497 seconds and 1.6799 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 32000x4096, 32000)
cpp_packed_gemm_449 0.7865 ms 100.0%
_weight_int8pack_mm 0.8150 ms 96.5%
SingleProcess AUTOTUNE benchmarking takes 0.2627 seconds and 1.6732 seconds precompiling
```
* input tokens 2016; output tokens 32; BS 1
```
AUTOTUNE _weight_int8pack_mm(8064x4096, 4096x4096, 4096)
cpp_packed_gemm_0 21.9915 ms 100.0%
_weight_int8pack_mm 99.0277 ms 22.2%
SingleProcess AUTOTUNE benchmarking takes 1.9343 seconds and 2.0786 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8064x4096, 11008x4096, 11008)
cpp_packed_gemm_4 48.0393 ms 100.0%
_weight_int8pack_mm 259.1158 ms 18.5%
SingleProcess AUTOTUNE benchmarking takes 3.5328 seconds and 1.8023 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8064x11008, 4096x11008, 4096)
cpp_packed_gemm_6 47.7776 ms 100.0%
_weight_int8pack_mm 230.3853 ms 20.7%
SingleProcess AUTOTUNE benchmarking takes 3.4548 seconds and 1.8313 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8064x4096, 32000x4096, 32000)
cpp_packed_gemm_224 150.7270 ms 100.0%
_weight_int8pack_mm 873.7302 ms 17.3%
SingleProcess AUTOTUNE benchmarking takes 8.8121 seconds and 1.7985 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 4096x4096, 4096)
_weight_int8pack_mm 0.0623 ms 100.0%
cpp_packed_gemm_225 0.0801 ms 77.8%
SingleProcess AUTOTUNE benchmarking takes 0.2484 seconds and 1.8296 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 11008x4096, 11008)
_weight_int8pack_mm 0.1414 ms 100.0%
cpp_packed_gemm_229 0.1791 ms 79.0%
SingleProcess AUTOTUNE benchmarking takes 0.2494 seconds and 1.7034 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x11008, 4096x11008, 4096)
_weight_int8pack_mm 0.1433 ms 100.0%
cpp_packed_gemm_231 0.1613 ms 88.9%
SingleProcess AUTOTUNE benchmarking takes 0.2484 seconds and 1.7163 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4x4096, 32000x4096, 32000)
_weight_int8pack_mm 0.3981 ms 100.0%
cpp_packed_gemm_449 0.4422 ms 90.0%
SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 1.7106 seconds precompiling
```
* input tokens 2016; output tokens 32; BS 2
```
AUTOTUNE _weight_int8pack_mm(16128x4096, 4096x4096, 4096)
cpp_packed_gemm_0 35.8847 ms 100.0%
_weight_int8pack_mm 198.8356 ms 18.0%
SingleProcess AUTOTUNE benchmarking takes 3.5105 seconds and 2.1279 seconds precompiling
AUTOTUNE _weight_int8pack_mm(16128x4096, 11008x4096, 11008)
cpp_packed_gemm_4 95.3577 ms 100.0%
_weight_int8pack_mm 474.8219 ms 20.1%
SingleProcess AUTOTUNE benchmarking takes 6.7609 seconds and 1.7958 seconds precompiling
AUTOTUNE _weight_int8pack_mm(16128x11008, 4096x11008, 4096)
cpp_packed_gemm_6 93.4362 ms 100.0%
_weight_int8pack_mm 468.7547 ms 19.9%
SingleProcess AUTOTUNE benchmarking takes 6.7345 seconds and 1.8260 seconds precompiling
AUTOTUNE _weight_int8pack_mm(16128x4096, 32000x4096, 32000)
cpp_packed_gemm_224 259.3309 ms 100.0%
_weight_int8pack_mm 1634.8804 ms 15.9%
SingleProcess AUTOTUNE benchmarking takes 17.3738 seconds and 1.8103 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 4096x4096, 4096)
_weight_int8pack_mm 0.1086 ms 100.0%
cpp_packed_gemm_225 0.1346 ms 80.7%
SingleProcess AUTOTUNE benchmarking takes 0.2480 seconds and 1.7523 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 11008x4096, 11008)
_weight_int8pack_mm 0.2766 ms 100.0%
cpp_packed_gemm_229 0.3047 ms 90.8%
SingleProcess AUTOTUNE benchmarking takes 0.2521 seconds and 1.6917 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x11008, 4096x11008, 4096)
_weight_int8pack_mm 0.2337 ms 100.0%
cpp_packed_gemm_231 0.2738 ms 85.4%
SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 1.6959 seconds precompiling
AUTOTUNE _weight_int8pack_mm(8x4096, 32000x4096, 32000)
cpp_packed_gemm_449 0.7893 ms 100.0%
_weight_int8pack_mm 0.8397 ms 94.0%
SingleProcess AUTOTUNE benchmarking takes 0.2816 seconds and 1.6791 seconds precompiling
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment