Created
June 15, 2015 14:46
-
-
Save skaae/8fe6a2ac2a8dfe881c6c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Function profiling | |
================== | |
Message: /home/soren/Documents/experiments/TRANSFORMER_NET/grutranstest.py:271 | |
Time in 5 calls to Function.__call__: 1.034791e+01s | |
Time in Function.fn.__call__: 1.034693e+01s (99.991%) | |
Time in thunks: 1.022032e+01s (98.767%) | |
Total compile time: 4.257923e+01s | |
Number of Apply nodes: 1224 | |
Theano Optimizer time: 1.321189e+01s | |
Theano validate time: 8.471053e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.840229e+01s | |
Import time 5.928850e-02s | |
Time in all call to theano.grad() 3.997459e-01s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
63.1% 63.1% 6.444s 3.22e-01s Py 20 4 theano.tensor.subtensor.AdvancedIncSubtensor | |
8.6% 71.6% 0.876s 6.49e-03s C 135 27 theano.sandbox.cuda.basic_ops.HostFromGpu | |
7.3% 78.9% 0.743s 8.74e-03s C 85 17 theano.sandbox.cuda.basic_ops.GpuFromHost | |
6.6% 85.5% 0.677s 3.38e-02s Py 20 4 theano.tensor.subtensor.AdvancedSubtensor | |
3.6% 89.1% 0.366s 3.86e-03s C 95 19 theano.sandbox.cuda.dnn.GpuDnnConvGradW | |
3.4% 92.6% 0.352s 2.15e-04s C 1640 328 theano.sandbox.cuda.basic_ops.GpuElemwise | |
1.8% 94.4% 0.185s 3.71e-03s C 50 10 theano.sandbox.cuda.dnn.GpuDnnConvGradI | |
1.3% 95.7% 0.131s 6.57e-04s C 200 40 theano.sandbox.cuda.basic_ops.GpuContiguous | |
0.7% 96.4% 0.074s 3.22e-04s C 230 46 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
0.6% 97.0% 0.065s 1.17e-03s C 55 11 theano.sandbox.cuda.dnn.GpuDnnConv | |
0.4% 97.4% 0.045s 2.13e-04s C 210 42 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 97.9% 0.043s 7.74e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad | |
0.4% 98.3% 0.041s 8.24e-04s C 50 10 theano.tensor.basic.Join | |
0.3% 98.5% 0.027s 1.09e-04s C 250 50 theano.sandbox.cuda.blas.GpuDot22 | |
0.3% 98.8% 0.027s 1.04e-04s Py 260 52 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.2% 99.0% 0.021s 3.16e-04s C 65 13 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.2% 99.2% 0.020s 3.66e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMax | |
0.2% 99.4% 0.019s 5.54e-04s C 35 7 theano.tensor.basic.Alloc | |
0.1% 99.5% 0.014s 2.69e-05s C 535 107 theano.tensor.elemwise.Elemwise | |
0.1% 99.6% 0.011s 3.22e-04s Py 35 7 theano.sandbox.cuda.basic_ops.GpuSplit | |
... (remaining 23 Classes account for 0.36%(0.04s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
63.1% 63.1% 6.444s 3.22e-01s Py 20 4 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False} | |
8.6% 71.6% 0.876s 6.49e-03s C 135 27 HostFromGpu | |
7.3% 78.9% 0.743s 8.74e-03s C 85 17 GpuFromHost | |
6.6% 85.5% 0.677s 3.38e-02s Py 20 4 AdvancedSubtensor | |
2.6% 88.1% 0.267s 3.82e-03s C 70 14 GpuDnnConvGradW{inplace=True} | |
1.8% 90.0% 0.185s 3.71e-03s C 50 10 GpuDnnConvGradI{inplace=True} | |
1.4% 91.3% 0.141s 5.03e-04s C 280 56 GpuElemwise{mul,no_inplace} | |
1.3% 92.6% 0.131s 6.57e-04s C 200 40 GpuContiguous | |
1.0% 93.6% 0.099s 3.96e-03s C 25 5 GpuDnnConvGradW{inplace=False} | |
0.6% 94.2% 0.065s 1.17e-03s C 55 11 GpuDnnConv{workmem='small', inplace=True} | |
0.4% 94.6% 0.044s 3.23e-04s C 135 27 GpuElemwise{add,no_inplace} | |
0.4% 95.1% 0.043s 7.74e-04s C 55 11 GpuDownsampleFactorMaxGrad{(2, 2),False} | |
0.4% 95.5% 0.041s 8.24e-04s C 50 10 Join | |
0.4% 95.8% 0.039s 4.94e-04s C 80 16 GpuCAReduce{add}{0,1} | |
0.3% 96.2% 0.032s 5.78e-04s C 55 11 GpuCAReduce{add}{1,0,1,1} | |
0.3% 96.5% 0.031s 3.10e-04s C 100 20 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace} | |
0.3% 96.8% 0.030s 1.49e-03s C 20 4 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace} | |
0.3% 97.0% 0.027s 1.09e-04s C 250 50 GpuDot22 | |
0.2% 97.3% 0.024s 1.69e-04s Py 140 28 GpuReshape{2} | |
0.2% 97.5% 0.021s 5.30e-04s C 40 8 GpuIncSubtensor{InplaceInc;int32:int32:} | |
... (remaining 136 Ops account for 2.54%(0.26s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
16.0% 16.0% 1.639s 3.28e-01s 5 902 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
15.7% 31.8% 1.606s 3.21e-01s 5 1192 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
15.7% 47.4% 1.602s 3.20e-01s 5 979 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
15.6% 63.1% 1.597s 3.19e-01s 5 837 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
1.9% 65.0% 0.199s 3.98e-02s 5 1190 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
1.9% 66.9% 0.199s 3.97e-02s 5 835 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
1.9% 68.9% 0.198s 3.97e-02s 5 977 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
1.9% 70.8% 0.194s 3.88e-02s 5 900 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
1.7% 72.5% 0.175s 3.50e-02s 5 563 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
1.6% 74.1% 0.169s 3.37e-02s 5 685 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
1.6% 75.8% 0.167s 3.35e-02s 5 310 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
1.6% 77.4% 0.166s 3.32e-02s 5 437 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
1.4% 78.9% 0.148s 2.96e-02s 5 686 GpuFromHost(AdvancedSubtensor.0) | |
1.4% 80.3% 0.148s 2.96e-02s 5 438 GpuFromHost(AdvancedSubtensor.0) | |
1.4% 81.7% 0.147s 2.94e-02s 5 564 GpuFromHost(AdvancedSubtensor.0) | |
1.4% 83.2% 0.146s 2.93e-02s 5 311 GpuFromHost(AdvancedSubtensor.0) | |
0.6% 83.7% 0.057s 1.15e-02s 5 232 HostFromGpu(GpuDimShuffle{0,2,3,1}.0) | |
0.4% 84.1% 0.038s 7.56e-03s 5 981 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0) | |
0.4% 84.5% 0.037s 7.48e-03s 5 839 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0) | |
0.4% 84.8% 0.037s 7.36e-03s 5 904 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0) | |
... (remaining 1204 Apply instances account for 15.17%(1.55s) of the runtime) | |
Function profiling | |
================== | |
Message: /home/soren/Documents/experiments/TRANSFORMER_NET/grutranstest.py:272 | |
Time in 78 calls to Function.__call__: 2.856662e+01s | |
Time in Function.fn.__call__: 2.856239e+01s (99.985%) | |
Time in thunks: 2.766105e+01s (96.830%) | |
Total compile time: 3.765710e+00s | |
Number of Apply nodes: 551 | |
Theano Optimizer time: 3.348803e+00s | |
Theano validate time: 7.856536e-02s | |
Theano Linker time (includes C, CUDA code generation/compiling): 3.288062e-01s | |
Import time 2.504110e-03s | |
Time in all call to theano.grad() 3.997459e-01s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
38.9% 38.9% 10.747s 3.44e-02s Py 312 4 theano.tensor.subtensor.AdvancedSubtensor | |
33.9% 72.7% 9.370s 1.20e-02s C 780 10 theano.sandbox.cuda.basic_ops.GpuFromHost | |
6.5% 79.3% 1.804s 2.34e-04s C 7722 99 theano.sandbox.cuda.basic_ops.GpuElemwise | |
5.8% 85.0% 1.600s 1.21e-03s C 1326 17 theano.sandbox.cuda.basic_ops.GpuContiguous | |
4.9% 90.0% 1.361s 9.69e-04s C 1404 18 theano.sandbox.cuda.basic_ops.HostFromGpu | |
3.6% 93.5% 0.990s 1.15e-03s C 858 11 theano.sandbox.cuda.dnn.GpuDnnConv | |
2.3% 95.8% 0.637s 8.16e-04s C 780 10 theano.tensor.basic.Join | |
1.1% 96.9% 0.308s 3.59e-04s C 858 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMax | |
1.0% 97.9% 0.272s 3.97e-05s C 6864 88 theano.tensor.elemwise.Elemwise | |
0.7% 98.6% 0.188s 1.01e-04s C 1872 24 theano.sandbox.cuda.blas.GpuDot22 | |
0.5% 99.1% 0.145s 1.43e-04s Py 1014 13 theano.sandbox.cuda.basic_ops.GpuFlatten | |
0.3% 99.4% 0.078s 1.43e-04s C 546 7 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.3% 99.7% 0.071s 8.22e-05s C 858 11 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.1% 99.8% 0.039s 2.60e-05s Py 1482 19 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.1% 99.9% 0.016s 1.99e-04s C 78 1 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.0% 99.9% 0.009s 1.88e-05s C 468 6 theano.tensor.basic.Alloc | |
0.0% 99.9% 0.006s 1.86e-06s C 3276 42 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 99.9% 0.006s 1.31e-06s C 4212 54 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.0% 100.0% 0.003s 8.69e-07s C 3900 50 theano.compile.ops.Shape_i | |
0.0% 100.0% 0.003s 1.42e-05s Py 234 3 theano.tensor.basic.ARange | |
... (remaining 8 Classes account for 0.03%(0.01s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
38.9% 38.9% 10.747s 3.44e-02s Py 312 4 AdvancedSubtensor | |
33.9% 72.7% 9.370s 1.20e-02s C 780 10 GpuFromHost | |
5.8% 78.5% 1.600s 1.21e-03s C 1326 17 GpuContiguous | |
4.9% 83.4% 1.361s 9.69e-04s C 1404 18 HostFromGpu | |
3.6% 87.0% 0.990s 1.15e-03s C 858 11 GpuDnnConv{workmem='small', inplace=True} | |
3.0% 90.0% 0.827s 5.58e-04s C 1482 19 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace} | |
2.3% 92.3% 0.637s 8.16e-04s C 780 10 Join | |
1.5% 93.7% 0.401s 1.29e-03s C 312 4 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace} | |
1.1% 94.9% 0.308s 3.59e-04s C 858 11 GpuDownsampleFactorMax{(2, 2),False} | |
1.0% 95.8% 0.267s 1.80e-04s C 1482 19 Elemwise{Cast{int32}} | |
0.7% 96.5% 0.188s 1.01e-04s C 1872 24 GpuDot22 | |
0.5% 97.0% 0.146s 1.33e-04s C 1092 14 GpuElemwise{sub,no_inplace} | |
0.5% 97.5% 0.132s 1.06e-04s C 1248 16 GpuElemwise{mul,no_inplace} | |
0.3% 97.9% 0.097s 1.55e-04s Py 624 8 GpuFlatten{1} | |
0.3% 98.1% 0.078s 1.43e-04s C 546 7 GpuJoin | |
0.3% 98.4% 0.071s 8.22e-05s C 858 11 GpuAllocEmpty | |
0.2% 98.6% 0.068s 1.46e-04s C 468 6 GpuElemwise{clip,no_inplace} | |
0.2% 98.9% 0.060s 9.66e-05s C 624 8 GpuElemwise{Composite{(i0 * (i1 + i2) * i3)},no_inplace} | |
0.2% 99.1% 0.060s 9.65e-05s C 624 8 GpuElemwise{Composite{clip((i0 + i1), i2, i3)},no_inplace} | |
0.2% 99.3% 0.058s 9.35e-05s C 624 8 GpuElemwise{floor,no_inplace} | |
... (remaining 63 Ops account for 0.71%(0.20s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
9.8% 9.8% 2.719s 3.49e-02s 78 216 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
9.7% 19.5% 2.688s 3.45e-02s 78 500 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
9.7% 29.2% 2.671s 3.42e-02s 78 312 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
9.7% 38.9% 2.670s 3.42e-02s 78 408 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
8.5% 47.3% 2.343s 3.00e-02s 78 409 GpuFromHost(AdvancedSubtensor.0) | |
8.4% 55.7% 2.321s 2.98e-02s 78 501 GpuFromHost(AdvancedSubtensor.0) | |
8.4% 64.1% 2.312s 2.96e-02s 78 313 GpuFromHost(AdvancedSubtensor.0) | |
8.4% 72.4% 2.311s 2.96e-02s 78 217 GpuFromHost(AdvancedSubtensor.0) | |
3.3% 75.7% 0.902s 1.16e-02s 78 150 HostFromGpu(GpuDimShuffle{0,2,3,1}.0) | |
1.5% 77.1% 0.402s 5.15e-03s 78 514 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.4% 78.6% 0.400s 5.13e-03s 78 230 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.4% 80.0% 0.399s 5.11e-03s 78 422 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.4% 81.5% 0.399s 5.11e-03s 78 326 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.3% 82.7% 0.352s 4.52e-03s 78 105 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{workmem='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0) | |
0.6% 83.3% 0.161s 2.06e-03s 78 81 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
0.5% 83.8% 0.138s 1.77e-03s 78 435 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
0.5% 84.3% 0.138s 1.76e-03s 78 243 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
0.5% 84.8% 0.138s 1.76e-03s 78 339 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
0.5% 85.3% 0.137s 1.76e-03s 78 527 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
0.5% 85.8% 0.129s 1.65e-03s 78 158 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0}) | |
... (remaining 531 Apply instances account for 14.23%(3.94s) of the runtime) | |
Function profiling | |
================== | |
Message: Sum of all(2) printed profiles at exit excluding Scan op profile. | |
Time in 83 calls to Function.__call__: 3.891453e+01s | |
Time in Function.fn.__call__: 3.890932e+01s (99.987%) | |
Time in thunks: 3.788138e+01s (97.345%) | |
Total compile time: 4.634494e+01s | |
Number of Apply nodes: 1224 | |
Theano Optimizer time: 1.656069e+01s | |
Theano validate time: 9.256706e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.873110e+01s | |
Import time 6.179261e-02s | |
Time in all call to theano.grad() 3.997459e-01s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
30.2% 30.2% 11.424s 3.44e-02s Py 332 8 theano.tensor.subtensor.AdvancedSubtensor | |
26.7% 56.9% 10.113s 1.17e-02s C 865 27 theano.sandbox.cuda.basic_ops.GpuFromHost | |
17.0% 73.9% 6.444s 3.22e-01s Py 20 4 theano.tensor.subtensor.AdvancedIncSubtensor | |
5.9% 79.8% 2.237s 1.45e-03s C 1539 45 theano.sandbox.cuda.basic_ops.HostFromGpu | |
5.7% 85.5% 2.156s 2.30e-04s C 9362 427 theano.sandbox.cuda.basic_ops.GpuElemwise | |
4.6% 90.0% 1.731s 1.13e-03s C 1526 57 theano.sandbox.cuda.basic_ops.GpuContiguous | |
2.8% 92.8% 1.054s 1.15e-03s C 913 22 theano.sandbox.cuda.dnn.GpuDnnConv | |
1.8% 94.6% 0.678s 8.17e-04s C 830 20 theano.tensor.basic.Join | |
1.0% 95.6% 0.366s 3.86e-03s C 95 19 theano.sandbox.cuda.dnn.GpuDnnConvGradW | |
0.9% 96.4% 0.328s 3.60e-04s C 913 22 theano.sandbox.cuda.blas.GpuDownsampleFactorMax | |
0.8% 97.2% 0.287s 3.88e-05s C 7399 195 theano.tensor.elemwise.Elemwise | |
0.6% 97.8% 0.216s 1.02e-04s C 2122 74 theano.sandbox.cuda.blas.GpuDot22 | |
0.5% 98.3% 0.185s 3.71e-03s C 50 10 theano.sandbox.cuda.dnn.GpuDnnConvGradI | |
0.4% 98.7% 0.155s 1.43e-04s Py 1084 27 theano.sandbox.cuda.basic_ops.GpuFlatten | |
0.2% 98.9% 0.083s 1.43e-04s C 581 14 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.2% 99.1% 0.079s 7.99e-05s C 993 38 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.2% 99.3% 0.074s 3.22e-04s C 230 46 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
0.2% 99.5% 0.066s 3.77e-05s Py 1742 71 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.1% 99.6% 0.045s 2.13e-04s C 210 42 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.1% 99.7% 0.043s 7.74e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad | |
... (remaining 23 Classes account for 0.30%(0.12s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
30.2% 30.2% 11.424s 3.44e-02s Py 332 8 AdvancedSubtensor | |
26.7% 56.9% 10.113s 1.17e-02s C 865 27 GpuFromHost | |
17.0% 73.9% 6.444s 3.22e-01s Py 20 4 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False} | |
5.9% 79.8% 2.237s 1.45e-03s C 1539 45 HostFromGpu | |
4.6% 84.3% 1.731s 1.13e-03s C 1526 57 GpuContiguous | |
2.8% 87.1% 1.054s 1.15e-03s C 913 22 GpuDnnConv{workmem='small', inplace=True} | |
2.2% 89.3% 0.827s 5.58e-04s C 1482 19 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace} | |
1.8% 91.1% 0.678s 8.17e-04s C 830 20 Join | |
1.1% 92.2% 0.431s 1.30e-03s C 332 8 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace} | |
0.9% 93.1% 0.328s 3.60e-04s C 913 22 GpuDownsampleFactorMax{(2, 2),False} | |
0.7% 93.8% 0.281s 1.78e-04s C 1577 38 Elemwise{Cast{int32}} | |
0.7% 94.6% 0.273s 1.79e-04s C 1528 72 GpuElemwise{mul,no_inplace} | |
0.7% 95.3% 0.267s 3.82e-03s C 70 14 GpuDnnConvGradW{inplace=True} | |
0.6% 95.8% 0.216s 1.02e-04s C 2122 74 GpuDot22 | |
0.5% 96.3% 0.185s 3.71e-03s C 50 10 GpuDnnConvGradI{inplace=True} | |
0.4% 96.7% 0.155s 1.32e-04s C 1177 31 GpuElemwise{sub,no_inplace} | |
0.3% 97.0% 0.103s 1.54e-04s Py 669 17 GpuFlatten{1} | |
0.3% 97.3% 0.099s 3.96e-03s C 25 5 GpuDnnConvGradW{inplace=False} | |
0.2% 97.5% 0.083s 1.43e-04s C 581 14 GpuJoin | |
0.2% 97.7% 0.079s 7.99e-05s C 993 38 GpuAllocEmpty | |
... (remaining 144 Ops account for 2.30%(0.87s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
7.2% 7.2% 2.719s 3.49e-02s 78 216 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
7.1% 14.3% 2.688s 3.45e-02s 78 500 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
7.1% 21.3% 2.671s 3.42e-02s 78 312 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
7.0% 28.4% 2.670s 3.42e-02s 78 408 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0) | |
6.2% 34.6% 2.343s 3.00e-02s 78 409 GpuFromHost(AdvancedSubtensor.0) | |
6.1% 40.7% 2.321s 2.98e-02s 78 501 GpuFromHost(AdvancedSubtensor.0) | |
6.1% 46.8% 2.312s 2.96e-02s 78 313 GpuFromHost(AdvancedSubtensor.0) | |
6.1% 52.9% 2.311s 2.96e-02s 78 217 GpuFromHost(AdvancedSubtensor.0) | |
4.3% 57.2% 1.639s 3.28e-01s 5 902 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
4.2% 61.5% 1.606s 3.21e-01s 5 1192 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
4.2% 65.7% 1.602s 3.20e-01s 5 979 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
4.2% 69.9% 1.597s 3.19e-01s 5 837 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0) | |
2.4% 72.3% 0.902s 1.16e-02s 78 150 HostFromGpu(GpuDimShuffle{0,2,3,1}.0) | |
1.1% 73.3% 0.402s 5.15e-03s 78 514 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.1% 74.4% 0.400s 5.13e-03s 78 230 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.1% 75.4% 0.399s 5.11e-03s 78 422 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
1.1% 76.5% 0.399s 5.11e-03s 78 326 GpuContiguous(GpuDimShuffle{0,3,1,2}.0) | |
0.9% 77.4% 0.352s 4.52e-03s 78 105 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{workmem='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0) | |
0.5% 78.0% 0.199s 3.98e-02s 5 1190 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
0.5% 78.5% 0.199s 3.97e-02s 5 835 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0) | |
... (remaining 1755 Apply instances account for 21.52%(8.15s) of the runtime) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment