Example output from the script above (with a GeForce GTX 1080Ti)
Activating environment at `~/.julia/dev/TensorCast/Project.toml`
T = Array
transpose (2D)
457.101 μs (7 allocations: 240 bytes)
474.047 μs (10 allocations: 78.45 KiB)
permute (3D)
102.921 ms (7 allocations: 256 bytes)
99.391 ms (20 allocations: 7.63 MiB)
adjoint (2D)
427.734 μs (7 allocations: 240 bytes)
482.036 μs (9 allocations: 78.44 KiB)
view (2D)
190.244 μs (6 allocations: 144 bytes)
56.392 μs (10 allocations: 13.58 KiB)
T = CuArray
transpose (2D)
103.190 μs (64 allocations: 3.28 KiB)
143.827 μs (65 allocations: 3.59 KiB)
permute (3D)
7.104 ms (64 allocations: 3.53 KiB)
11.389 ms (99 allocations: 7.28 KiB)
adjoint (2D)
89.853 μs (63 allocations: 3.27 KiB)
145.562 μs (65 allocations: 3.59 KiB)
view (2D)
38.042 μs (86 allocations: 7.20 KiB)
38.880 μs (88 allocations: 7.30 KiB)