trace #30806 of torch.cat()
performance regression.
benchmark_all_test result, command line:
python -m benchmark_all_test --operators cat --tag_filter all
(pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M1_N1_K1_dim0_cpu
# Input: M: 1, N: 1, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 5.464
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M256_N512_K1_dim0_cpu
# Input: M: 256, N: 512, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 19.216
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 25.436
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim0_cpu
# Input: M: 128, N: 128, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 8.929
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim1_cpu
# Input: M: 128, N: 128, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 10.152
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim2_cpu
# Input: M: 128, N: 128, K: 1, dim: 2, device: cpu
Forward Execution Time (us) : 24.093
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim0_cpu
# Input: M: 128, N: 128, K: 2, dim: 0, device: cpu
Forward Execution Time (us) : 12.196
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim1_cpu
# Input: M: 128, N: 128, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 14.418
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim2_cpu
# Input: M: 128, N: 128, K: 2, dim: 2, device: cpu
Forward Execution Time (us) : 246.521
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim0_cpu
# Input: M: 128, N: 1024, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 19.673
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim1_cpu
# Input: M: 128, N: 1024, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 18.370
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim2_cpu
# Input: M: 128, N: 1024, K: 1, dim: 2, device: cpu
Forward Execution Time (us) : 50.611
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim0_cpu
# Input: M: 128, N: 1024, K: 2, dim: 0, device: cpu
Forward Execution Time (us) : 21.160
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim1_cpu
# Input: M: 128, N: 1024, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 22.290
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim2_cpu
# Input: M: 128, N: 1024, K: 2, dim: 2, device: cpu
Forward Execution Time (us) : 275.948
pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M1_N1_K1_dim0_cpu
# Input: M: 1, N: 1, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 3.267
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M256_N512_K1_dim0_cpu
# Input: M: 256, N: 512, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 122.176
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 263.905
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim0_cpu
# Input: M: 128, N: 128, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 7.064
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim1_cpu
# Input: M: 128, N: 128, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 10.002
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K1_dim2_cpu
# Input: M: 128, N: 128, K: 1, dim: 2, device: cpu
Forward Execution Time (us) : 555.969
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim0_cpu
# Input: M: 128, N: 128, K: 2, dim: 0, device: cpu
Forward Execution Time (us) : 11.476
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim1_cpu
# Input: M: 128, N: 128, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 15.089
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N128_K2_dim2_cpu
# Input: M: 128, N: 128, K: 2, dim: 2, device: cpu
Forward Execution Time (us) : 569.621
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim0_cpu
# Input: M: 128, N: 1024, K: 1, dim: 0, device: cpu
Forward Execution Time (us) : 97.550
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim1_cpu
# Input: M: 128, N: 1024, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 65.711
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K1_dim2_cpu
# Input: M: 128, N: 1024, K: 1, dim: 2, device: cpu
Forward Execution Time (us) : 4446.804
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim0_cpu
# Input: M: 128, N: 1024, K: 2, dim: 0, device: cpu
Forward Execution Time (us) : 272.241
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim1_cpu
# Input: M: 128, N: 1024, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 126.997
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_M128_N1024_K2_dim2_cpu
# Input: M: 128, N: 1024, K: 2, dim: 2, device: cpu
Forward Execution Time (us) : 4529.495
- FB test machine, envronment variables, OMP_NUM_THREADS, KMP_AFFINITY, w/wo jemalloc?