Created
November 22, 2017 21:00
-
-
Save zou3519/c34e46352f13034d5ee0a6b8148e5557 to your computer and use it in GitHub Desktop.
Before/after numbers from changing cuda varInnermostDim to use warp shuffle reduces
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
tensor = torch.randn(100).cuda() | |
%timeit tensor.var(0); torch.cuda.synchronize() | |
tensor = torch.randn(10000).cuda() | |
%timeit tensor.var(0); torch.cuda.synchronize() | |
tensor = torch.randn(1000, 2, 10).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(10000, 2, 10).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(50000, 2, 10).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(2, 2, 2).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(100, 100, 100).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(1000, 10, 1000).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(5, 2, 10000).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
tensor = torch.randn(5, 2, 100000).cuda() | |
%timeit tensor.var(2); torch.cuda.synchronize() | |
Before changes: | |
19.8 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
250 µs ± 456 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
18.8 µs ± 55.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) | |
41.3 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
139 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
17.5 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) | |
49.8 µs ± 366 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
373 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
254 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
3.28 ms ± 728 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) | |
After changes: | |
19.3 µs ± 288 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
247 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
18.4 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) | |
38.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
120 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
16.8 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) | |
47.5 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) | |
373 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
250 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) | |
3.25 ms ± 396 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment