Created
September 26, 2024 05:01
-
-
Save leslie-fang-intel/00a5777229e8a5a850739373cb48c88f to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ## All shapes | |
| * input tokens 1024; output tokens 128; BS 1 | |
| ``` | |
| AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) | |
| cpp_packed_gemm_0 10.8958 ms 100.0% | |
| _weight_int8pack_mm 50.9464 ms 21.4% | |
| SingleProcess AUTOTUNE benchmarking takes 1.0826 seconds and 1.8839 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) | |
| cpp_packed_gemm_4 24.0196 ms 100.0% | |
| _weight_int8pack_mm 119.4106 ms 20.1% | |
| SingleProcess AUTOTUNE benchmarking takes 1.8292 seconds and 1.8225 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) | |
| cpp_packed_gemm_6 24.5435 ms 100.0% | |
| _weight_int8pack_mm 119.4160 ms 20.6% | |
| SingleProcess AUTOTUNE benchmarking takes 1.8293 seconds and 1.8502 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_224 77.5536 ms 100.0% | |
| _weight_int8pack_mm 390.3881 ms 19.9% | |
| SingleProcess AUTOTUNE benchmarking takes 4.4929 seconds and 1.8018 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 4096x4096, 4096) | |
| _weight_int8pack_mm 0.0626 ms 100.0% | |
| cpp_packed_gemm_225 0.0790 ms 79.3% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 1.7496 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 11008x4096, 11008) | |
| _weight_int8pack_mm 0.1422 ms 100.0% | |
| cpp_packed_gemm_229 0.1727 ms 82.3% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2500 seconds and 1.6883 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x11008, 4096x11008, 4096) | |
| _weight_int8pack_mm 0.1180 ms 100.0% | |
| cpp_packed_gemm_231 0.1612 ms 73.2% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2476 seconds and 1.6865 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 32000x4096, 32000) | |
| _weight_int8pack_mm 0.3984 ms 100.0% | |
| cpp_packed_gemm_449 0.4407 ms 90.4% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2574 seconds and 1.6849 seconds precompiling | |
| ``` | |
| * input tokens 1024; output tokens 128; BS 2 | |
| ``` | |
| AUTOTUNE _weight_int8pack_mm(8192x4096, 4096x4096, 4096) | |
| cpp_packed_gemm_0 22.8734 ms 100.0% | |
| _weight_int8pack_mm 100.1233 ms 22.8% | |
| SingleProcess AUTOTUNE benchmarking takes 1.8709 seconds and 2.0644 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8192x4096, 11008x4096, 11008) | |
| cpp_packed_gemm_4 48.8860 ms 100.0% | |
| _weight_int8pack_mm 251.6516 ms 19.4% | |
| SingleProcess AUTOTUNE benchmarking takes 3.5835 seconds and 1.7815 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8192x11008, 4096x11008, 4096) | |
| cpp_packed_gemm_6 51.5397 ms 100.0% | |
| _weight_int8pack_mm 239.0896 ms 21.6% | |
| SingleProcess AUTOTUNE benchmarking takes 3.4701 seconds and 1.8435 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8192x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_224 179.3097 ms 100.0% | |
| _weight_int8pack_mm 827.3955 ms 21.7% | |
| SingleProcess AUTOTUNE benchmarking takes 8.9030 seconds and 1.7991 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 4096x4096, 4096) | |
| _weight_int8pack_mm 0.1092 ms 100.0% | |
| cpp_packed_gemm_225 0.1345 ms 81.2% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2479 seconds and 1.8676 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 11008x4096, 11008) | |
| _weight_int8pack_mm 0.2760 ms 100.0% | |
| cpp_packed_gemm_229 0.3062 ms 90.1% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2507 seconds and 1.6980 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x11008, 4096x11008, 4096) | |
| _weight_int8pack_mm 0.2312 ms 100.0% | |
| cpp_packed_gemm_231 0.2743 ms 84.3% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2497 seconds and 1.6799 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_449 0.7865 ms 100.0% | |
| _weight_int8pack_mm 0.8150 ms 96.5% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2627 seconds and 1.6732 seconds precompiling | |
| ``` | |
| * input tokens 2016; output tokens 32; BS 1 | |
| ``` | |
| AUTOTUNE _weight_int8pack_mm(8064x4096, 4096x4096, 4096) | |
| cpp_packed_gemm_0 21.9915 ms 100.0% | |
| _weight_int8pack_mm 99.0277 ms 22.2% | |
| SingleProcess AUTOTUNE benchmarking takes 1.9343 seconds and 2.0786 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8064x4096, 11008x4096, 11008) | |
| cpp_packed_gemm_4 48.0393 ms 100.0% | |
| _weight_int8pack_mm 259.1158 ms 18.5% | |
| SingleProcess AUTOTUNE benchmarking takes 3.5328 seconds and 1.8023 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8064x11008, 4096x11008, 4096) | |
| cpp_packed_gemm_6 47.7776 ms 100.0% | |
| _weight_int8pack_mm 230.3853 ms 20.7% | |
| SingleProcess AUTOTUNE benchmarking takes 3.4548 seconds and 1.8313 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8064x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_224 150.7270 ms 100.0% | |
| _weight_int8pack_mm 873.7302 ms 17.3% | |
| SingleProcess AUTOTUNE benchmarking takes 8.8121 seconds and 1.7985 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 4096x4096, 4096) | |
| _weight_int8pack_mm 0.0623 ms 100.0% | |
| cpp_packed_gemm_225 0.0801 ms 77.8% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2484 seconds and 1.8296 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 11008x4096, 11008) | |
| _weight_int8pack_mm 0.1414 ms 100.0% | |
| cpp_packed_gemm_229 0.1791 ms 79.0% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2494 seconds and 1.7034 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x11008, 4096x11008, 4096) | |
| _weight_int8pack_mm 0.1433 ms 100.0% | |
| cpp_packed_gemm_231 0.1613 ms 88.9% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2484 seconds and 1.7163 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(4x4096, 32000x4096, 32000) | |
| _weight_int8pack_mm 0.3981 ms 100.0% | |
| cpp_packed_gemm_449 0.4422 ms 90.0% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 1.7106 seconds precompiling | |
| ``` | |
| * input tokens 2016; output tokens 32; BS 2 | |
| ``` | |
| AUTOTUNE _weight_int8pack_mm(16128x4096, 4096x4096, 4096) | |
| cpp_packed_gemm_0 35.8847 ms 100.0% | |
| _weight_int8pack_mm 198.8356 ms 18.0% | |
| SingleProcess AUTOTUNE benchmarking takes 3.5105 seconds and 2.1279 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(16128x4096, 11008x4096, 11008) | |
| cpp_packed_gemm_4 95.3577 ms 100.0% | |
| _weight_int8pack_mm 474.8219 ms 20.1% | |
| SingleProcess AUTOTUNE benchmarking takes 6.7609 seconds and 1.7958 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(16128x11008, 4096x11008, 4096) | |
| cpp_packed_gemm_6 93.4362 ms 100.0% | |
| _weight_int8pack_mm 468.7547 ms 19.9% | |
| SingleProcess AUTOTUNE benchmarking takes 6.7345 seconds and 1.8260 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(16128x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_224 259.3309 ms 100.0% | |
| _weight_int8pack_mm 1634.8804 ms 15.9% | |
| SingleProcess AUTOTUNE benchmarking takes 17.3738 seconds and 1.8103 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 4096x4096, 4096) | |
| _weight_int8pack_mm 0.1086 ms 100.0% | |
| cpp_packed_gemm_225 0.1346 ms 80.7% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2480 seconds and 1.7523 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 11008x4096, 11008) | |
| _weight_int8pack_mm 0.2766 ms 100.0% | |
| cpp_packed_gemm_229 0.3047 ms 90.8% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2521 seconds and 1.6917 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x11008, 4096x11008, 4096) | |
| _weight_int8pack_mm 0.2337 ms 100.0% | |
| cpp_packed_gemm_231 0.2738 ms 85.4% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 1.6959 seconds precompiling | |
| AUTOTUNE _weight_int8pack_mm(8x4096, 32000x4096, 32000) | |
| cpp_packed_gemm_449 0.7893 ms 100.0% | |
| _weight_int8pack_mm 0.8397 ms 94.0% | |
| SingleProcess AUTOTUNE benchmarking takes 0.2816 seconds and 1.6791 seconds precompiling | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment