Skip to content

Instantly share code, notes, and snippets.

@zou3519
Created November 22, 2017 21:00
Show Gist options
  • Save zou3519/c34e46352f13034d5ee0a6b8148e5557 to your computer and use it in GitHub Desktop.
Save zou3519/c34e46352f13034d5ee0a6b8148e5557 to your computer and use it in GitHub Desktop.
Before/after numbers from changing cuda varInnermostDim to use warp shuffle reduces
import torch
tensor = torch.randn(100).cuda()
%timeit tensor.var(0); torch.cuda.synchronize()
tensor = torch.randn(10000).cuda()
%timeit tensor.var(0); torch.cuda.synchronize()
tensor = torch.randn(1000, 2, 10).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(10000, 2, 10).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(50000, 2, 10).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(2, 2, 2).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(100, 100, 100).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(1000, 10, 1000).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(5, 2, 10000).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
tensor = torch.randn(5, 2, 100000).cuda()
%timeit tensor.var(2); torch.cuda.synchronize()
Before changes:
19.8 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
250 µs ± 456 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
18.8 µs ± 55.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
41.3 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
139 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.5 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
49.8 µs ± 366 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
373 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
254 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.28 ms ± 728 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
After changes:
19.3 µs ± 288 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
247 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
18.4 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
38.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
120 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16.8 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
47.5 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
373 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
250 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.25 ms ± 396 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment