Skip to content

Instantly share code, notes, and snippets.

@skaae
Created June 15, 2015 14:46
Show Gist options
  • Save skaae/8fe6a2ac2a8dfe881c6c to your computer and use it in GitHub Desktop.
Save skaae/8fe6a2ac2a8dfe881c6c to your computer and use it in GitHub Desktop.
Function profiling
==================
Message: /home/soren/Documents/experiments/TRANSFORMER_NET/grutranstest.py:271
Time in 5 calls to Function.__call__: 1.034791e+01s
Time in Function.fn.__call__: 1.034693e+01s (99.991%)
Time in thunks: 1.022032e+01s (98.767%)
Total compile time: 4.257923e+01s
Number of Apply nodes: 1224
Theano Optimizer time: 1.321189e+01s
Theano validate time: 8.471053e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.840229e+01s
Import time 5.928850e-02s
Time in all call to theano.grad() 3.997459e-01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
63.1% 63.1% 6.444s 3.22e-01s Py 20 4 theano.tensor.subtensor.AdvancedIncSubtensor
8.6% 71.6% 0.876s 6.49e-03s C 135 27 theano.sandbox.cuda.basic_ops.HostFromGpu
7.3% 78.9% 0.743s 8.74e-03s C 85 17 theano.sandbox.cuda.basic_ops.GpuFromHost
6.6% 85.5% 0.677s 3.38e-02s Py 20 4 theano.tensor.subtensor.AdvancedSubtensor
3.6% 89.1% 0.366s 3.86e-03s C 95 19 theano.sandbox.cuda.dnn.GpuDnnConvGradW
3.4% 92.6% 0.352s 2.15e-04s C 1640 328 theano.sandbox.cuda.basic_ops.GpuElemwise
1.8% 94.4% 0.185s 3.71e-03s C 50 10 theano.sandbox.cuda.dnn.GpuDnnConvGradI
1.3% 95.7% 0.131s 6.57e-04s C 200 40 theano.sandbox.cuda.basic_ops.GpuContiguous
0.7% 96.4% 0.074s 3.22e-04s C 230 46 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 97.0% 0.065s 1.17e-03s C 55 11 theano.sandbox.cuda.dnn.GpuDnnConv
0.4% 97.4% 0.045s 2.13e-04s C 210 42 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 97.9% 0.043s 7.74e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.4% 98.3% 0.041s 8.24e-04s C 50 10 theano.tensor.basic.Join
0.3% 98.5% 0.027s 1.09e-04s C 250 50 theano.sandbox.cuda.blas.GpuDot22
0.3% 98.8% 0.027s 1.04e-04s Py 260 52 theano.sandbox.cuda.basic_ops.GpuReshape
0.2% 99.0% 0.021s 3.16e-04s C 65 13 theano.sandbox.cuda.basic_ops.GpuAlloc
0.2% 99.2% 0.020s 3.66e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.2% 99.4% 0.019s 5.54e-04s C 35 7 theano.tensor.basic.Alloc
0.1% 99.5% 0.014s 2.69e-05s C 535 107 theano.tensor.elemwise.Elemwise
0.1% 99.6% 0.011s 3.22e-04s Py 35 7 theano.sandbox.cuda.basic_ops.GpuSplit
... (remaining 23 Classes account for 0.36%(0.04s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
63.1% 63.1% 6.444s 3.22e-01s Py 20 4 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}
8.6% 71.6% 0.876s 6.49e-03s C 135 27 HostFromGpu
7.3% 78.9% 0.743s 8.74e-03s C 85 17 GpuFromHost
6.6% 85.5% 0.677s 3.38e-02s Py 20 4 AdvancedSubtensor
2.6% 88.1% 0.267s 3.82e-03s C 70 14 GpuDnnConvGradW{inplace=True}
1.8% 90.0% 0.185s 3.71e-03s C 50 10 GpuDnnConvGradI{inplace=True}
1.4% 91.3% 0.141s 5.03e-04s C 280 56 GpuElemwise{mul,no_inplace}
1.3% 92.6% 0.131s 6.57e-04s C 200 40 GpuContiguous
1.0% 93.6% 0.099s 3.96e-03s C 25 5 GpuDnnConvGradW{inplace=False}
0.6% 94.2% 0.065s 1.17e-03s C 55 11 GpuDnnConv{workmem='small', inplace=True}
0.4% 94.6% 0.044s 3.23e-04s C 135 27 GpuElemwise{add,no_inplace}
0.4% 95.1% 0.043s 7.74e-04s C 55 11 GpuDownsampleFactorMaxGrad{(2, 2),False}
0.4% 95.5% 0.041s 8.24e-04s C 50 10 Join
0.4% 95.8% 0.039s 4.94e-04s C 80 16 GpuCAReduce{add}{0,1}
0.3% 96.2% 0.032s 5.78e-04s C 55 11 GpuCAReduce{add}{1,0,1,1}
0.3% 96.5% 0.031s 3.10e-04s C 100 20 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
0.3% 96.8% 0.030s 1.49e-03s C 20 4 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace}
0.3% 97.0% 0.027s 1.09e-04s C 250 50 GpuDot22
0.2% 97.3% 0.024s 1.69e-04s Py 140 28 GpuReshape{2}
0.2% 97.5% 0.021s 5.30e-04s C 40 8 GpuIncSubtensor{InplaceInc;int32:int32:}
... (remaining 136 Ops account for 2.54%(0.26s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
16.0% 16.0% 1.639s 3.28e-01s 5 902 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
15.7% 31.8% 1.606s 3.21e-01s 5 1192 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
15.7% 47.4% 1.602s 3.20e-01s 5 979 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
15.6% 63.1% 1.597s 3.19e-01s 5 837 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
1.9% 65.0% 0.199s 3.98e-02s 5 1190 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
1.9% 66.9% 0.199s 3.97e-02s 5 835 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
1.9% 68.9% 0.198s 3.97e-02s 5 977 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
1.9% 70.8% 0.194s 3.88e-02s 5 900 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
1.7% 72.5% 0.175s 3.50e-02s 5 563 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
1.6% 74.1% 0.169s 3.37e-02s 5 685 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
1.6% 75.8% 0.167s 3.35e-02s 5 310 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
1.6% 77.4% 0.166s 3.32e-02s 5 437 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
1.4% 78.9% 0.148s 2.96e-02s 5 686 GpuFromHost(AdvancedSubtensor.0)
1.4% 80.3% 0.148s 2.96e-02s 5 438 GpuFromHost(AdvancedSubtensor.0)
1.4% 81.7% 0.147s 2.94e-02s 5 564 GpuFromHost(AdvancedSubtensor.0)
1.4% 83.2% 0.146s 2.93e-02s 5 311 GpuFromHost(AdvancedSubtensor.0)
0.6% 83.7% 0.057s 1.15e-02s 5 232 HostFromGpu(GpuDimShuffle{0,2,3,1}.0)
0.4% 84.1% 0.038s 7.56e-03s 5 981 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0)
0.4% 84.5% 0.037s 7.48e-03s 5 839 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0)
0.4% 84.8% 0.037s 7.36e-03s 5 904 GpuFromHost(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}.0)
... (remaining 1204 Apply instances account for 15.17%(1.55s) of the runtime)
Function profiling
==================
Message: /home/soren/Documents/experiments/TRANSFORMER_NET/grutranstest.py:272
Time in 78 calls to Function.__call__: 2.856662e+01s
Time in Function.fn.__call__: 2.856239e+01s (99.985%)
Time in thunks: 2.766105e+01s (96.830%)
Total compile time: 3.765710e+00s
Number of Apply nodes: 551
Theano Optimizer time: 3.348803e+00s
Theano validate time: 7.856536e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 3.288062e-01s
Import time 2.504110e-03s
Time in all call to theano.grad() 3.997459e-01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
38.9% 38.9% 10.747s 3.44e-02s Py 312 4 theano.tensor.subtensor.AdvancedSubtensor
33.9% 72.7% 9.370s 1.20e-02s C 780 10 theano.sandbox.cuda.basic_ops.GpuFromHost
6.5% 79.3% 1.804s 2.34e-04s C 7722 99 theano.sandbox.cuda.basic_ops.GpuElemwise
5.8% 85.0% 1.600s 1.21e-03s C 1326 17 theano.sandbox.cuda.basic_ops.GpuContiguous
4.9% 90.0% 1.361s 9.69e-04s C 1404 18 theano.sandbox.cuda.basic_ops.HostFromGpu
3.6% 93.5% 0.990s 1.15e-03s C 858 11 theano.sandbox.cuda.dnn.GpuDnnConv
2.3% 95.8% 0.637s 8.16e-04s C 780 10 theano.tensor.basic.Join
1.1% 96.9% 0.308s 3.59e-04s C 858 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
1.0% 97.9% 0.272s 3.97e-05s C 6864 88 theano.tensor.elemwise.Elemwise
0.7% 98.6% 0.188s 1.01e-04s C 1872 24 theano.sandbox.cuda.blas.GpuDot22
0.5% 99.1% 0.145s 1.43e-04s Py 1014 13 theano.sandbox.cuda.basic_ops.GpuFlatten
0.3% 99.4% 0.078s 1.43e-04s C 546 7 theano.sandbox.cuda.basic_ops.GpuJoin
0.3% 99.7% 0.071s 8.22e-05s C 858 11 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.1% 99.8% 0.039s 2.60e-05s Py 1482 19 theano.sandbox.cuda.basic_ops.GpuReshape
0.1% 99.9% 0.016s 1.99e-04s C 78 1 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 99.9% 0.009s 1.88e-05s C 468 6 theano.tensor.basic.Alloc
0.0% 99.9% 0.006s 1.86e-06s C 3276 42 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 99.9% 0.006s 1.31e-06s C 4212 54 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.003s 8.69e-07s C 3900 50 theano.compile.ops.Shape_i
0.0% 100.0% 0.003s 1.42e-05s Py 234 3 theano.tensor.basic.ARange
... (remaining 8 Classes account for 0.03%(0.01s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
38.9% 38.9% 10.747s 3.44e-02s Py 312 4 AdvancedSubtensor
33.9% 72.7% 9.370s 1.20e-02s C 780 10 GpuFromHost
5.8% 78.5% 1.600s 1.21e-03s C 1326 17 GpuContiguous
4.9% 83.4% 1.361s 9.69e-04s C 1404 18 HostFromGpu
3.6% 87.0% 0.990s 1.15e-03s C 858 11 GpuDnnConv{workmem='small', inplace=True}
3.0% 90.0% 0.827s 5.58e-04s C 1482 19 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}
2.3% 92.3% 0.637s 8.16e-04s C 780 10 Join
1.5% 93.7% 0.401s 1.29e-03s C 312 4 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace}
1.1% 94.9% 0.308s 3.59e-04s C 858 11 GpuDownsampleFactorMax{(2, 2),False}
1.0% 95.8% 0.267s 1.80e-04s C 1482 19 Elemwise{Cast{int32}}
0.7% 96.5% 0.188s 1.01e-04s C 1872 24 GpuDot22
0.5% 97.0% 0.146s 1.33e-04s C 1092 14 GpuElemwise{sub,no_inplace}
0.5% 97.5% 0.132s 1.06e-04s C 1248 16 GpuElemwise{mul,no_inplace}
0.3% 97.9% 0.097s 1.55e-04s Py 624 8 GpuFlatten{1}
0.3% 98.1% 0.078s 1.43e-04s C 546 7 GpuJoin
0.3% 98.4% 0.071s 8.22e-05s C 858 11 GpuAllocEmpty
0.2% 98.6% 0.068s 1.46e-04s C 468 6 GpuElemwise{clip,no_inplace}
0.2% 98.9% 0.060s 9.66e-05s C 624 8 GpuElemwise{Composite{(i0 * (i1 + i2) * i3)},no_inplace}
0.2% 99.1% 0.060s 9.65e-05s C 624 8 GpuElemwise{Composite{clip((i0 + i1), i2, i3)},no_inplace}
0.2% 99.3% 0.058s 9.35e-05s C 624 8 GpuElemwise{floor,no_inplace}
... (remaining 63 Ops account for 0.71%(0.20s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
9.8% 9.8% 2.719s 3.49e-02s 78 216 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
9.7% 19.5% 2.688s 3.45e-02s 78 500 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
9.7% 29.2% 2.671s 3.42e-02s 78 312 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
9.7% 38.9% 2.670s 3.42e-02s 78 408 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
8.5% 47.3% 2.343s 3.00e-02s 78 409 GpuFromHost(AdvancedSubtensor.0)
8.4% 55.7% 2.321s 2.98e-02s 78 501 GpuFromHost(AdvancedSubtensor.0)
8.4% 64.1% 2.312s 2.96e-02s 78 313 GpuFromHost(AdvancedSubtensor.0)
8.4% 72.4% 2.311s 2.96e-02s 78 217 GpuFromHost(AdvancedSubtensor.0)
3.3% 75.7% 0.902s 1.16e-02s 78 150 HostFromGpu(GpuDimShuffle{0,2,3,1}.0)
1.5% 77.1% 0.402s 5.15e-03s 78 514 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.4% 78.6% 0.400s 5.13e-03s 78 230 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.4% 80.0% 0.399s 5.11e-03s 78 422 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.4% 81.5% 0.399s 5.11e-03s 78 326 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.3% 82.7% 0.352s 4.52e-03s 78 105 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{workmem='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
0.6% 83.3% 0.161s 2.06e-03s 78 81 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
0.5% 83.8% 0.138s 1.77e-03s 78 435 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
0.5% 84.3% 0.138s 1.76e-03s 78 243 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
0.5% 84.8% 0.138s 1.76e-03s 78 339 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
0.5% 85.3% 0.137s 1.76e-03s 78 527 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
0.5% 85.8% 0.129s 1.65e-03s 78 158 GpuDnnConv{workmem='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}, Constant{0.0})
... (remaining 531 Apply instances account for 14.23%(3.94s) of the runtime)
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 83 calls to Function.__call__: 3.891453e+01s
Time in Function.fn.__call__: 3.890932e+01s (99.987%)
Time in thunks: 3.788138e+01s (97.345%)
Total compile time: 4.634494e+01s
Number of Apply nodes: 1224
Theano Optimizer time: 1.656069e+01s
Theano validate time: 9.256706e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.873110e+01s
Import time 6.179261e-02s
Time in all call to theano.grad() 3.997459e-01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
30.2% 30.2% 11.424s 3.44e-02s Py 332 8 theano.tensor.subtensor.AdvancedSubtensor
26.7% 56.9% 10.113s 1.17e-02s C 865 27 theano.sandbox.cuda.basic_ops.GpuFromHost
17.0% 73.9% 6.444s 3.22e-01s Py 20 4 theano.tensor.subtensor.AdvancedIncSubtensor
5.9% 79.8% 2.237s 1.45e-03s C 1539 45 theano.sandbox.cuda.basic_ops.HostFromGpu
5.7% 85.5% 2.156s 2.30e-04s C 9362 427 theano.sandbox.cuda.basic_ops.GpuElemwise
4.6% 90.0% 1.731s 1.13e-03s C 1526 57 theano.sandbox.cuda.basic_ops.GpuContiguous
2.8% 92.8% 1.054s 1.15e-03s C 913 22 theano.sandbox.cuda.dnn.GpuDnnConv
1.8% 94.6% 0.678s 8.17e-04s C 830 20 theano.tensor.basic.Join
1.0% 95.6% 0.366s 3.86e-03s C 95 19 theano.sandbox.cuda.dnn.GpuDnnConvGradW
0.9% 96.4% 0.328s 3.60e-04s C 913 22 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.8% 97.2% 0.287s 3.88e-05s C 7399 195 theano.tensor.elemwise.Elemwise
0.6% 97.8% 0.216s 1.02e-04s C 2122 74 theano.sandbox.cuda.blas.GpuDot22
0.5% 98.3% 0.185s 3.71e-03s C 50 10 theano.sandbox.cuda.dnn.GpuDnnConvGradI
0.4% 98.7% 0.155s 1.43e-04s Py 1084 27 theano.sandbox.cuda.basic_ops.GpuFlatten
0.2% 98.9% 0.083s 1.43e-04s C 581 14 theano.sandbox.cuda.basic_ops.GpuJoin
0.2% 99.1% 0.079s 7.99e-05s C 993 38 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.2% 99.3% 0.074s 3.22e-04s C 230 46 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.2% 99.5% 0.066s 3.77e-05s Py 1742 71 theano.sandbox.cuda.basic_ops.GpuReshape
0.1% 99.6% 0.045s 2.13e-04s C 210 42 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.1% 99.7% 0.043s 7.74e-04s C 55 11 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
... (remaining 23 Classes account for 0.30%(0.12s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
30.2% 30.2% 11.424s 3.44e-02s Py 332 8 AdvancedSubtensor
26.7% 56.9% 10.113s 1.17e-02s C 865 27 GpuFromHost
17.0% 73.9% 6.444s 3.22e-01s Py 20 4 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}
5.9% 79.8% 2.237s 1.45e-03s C 1539 45 HostFromGpu
4.6% 84.3% 1.731s 1.13e-03s C 1526 57 GpuContiguous
2.8% 87.1% 1.054s 1.15e-03s C 913 22 GpuDnnConv{workmem='small', inplace=True}
2.2% 89.3% 0.827s 5.58e-04s C 1482 19 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}
1.8% 91.1% 0.678s 8.17e-04s C 830 20 Join
1.1% 92.2% 0.431s 1.30e-03s C 332 8 GpuElemwise{Composite{((((i0 * i1) + (i2 * i3)) + (i4 * i5)) + (i6 * i7))},no_inplace}
0.9% 93.1% 0.328s 3.60e-04s C 913 22 GpuDownsampleFactorMax{(2, 2),False}
0.7% 93.8% 0.281s 1.78e-04s C 1577 38 Elemwise{Cast{int32}}
0.7% 94.6% 0.273s 1.79e-04s C 1528 72 GpuElemwise{mul,no_inplace}
0.7% 95.3% 0.267s 3.82e-03s C 70 14 GpuDnnConvGradW{inplace=True}
0.6% 95.8% 0.216s 1.02e-04s C 2122 74 GpuDot22
0.5% 96.3% 0.185s 3.71e-03s C 50 10 GpuDnnConvGradI{inplace=True}
0.4% 96.7% 0.155s 1.32e-04s C 1177 31 GpuElemwise{sub,no_inplace}
0.3% 97.0% 0.103s 1.54e-04s Py 669 17 GpuFlatten{1}
0.3% 97.3% 0.099s 3.96e-03s C 25 5 GpuDnnConvGradW{inplace=False}
0.2% 97.5% 0.083s 1.43e-04s C 581 14 GpuJoin
0.2% 97.7% 0.079s 7.99e-05s C 993 38 GpuAllocEmpty
... (remaining 144 Ops account for 2.30%(0.87s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
7.2% 7.2% 2.719s 3.49e-02s 78 216 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
7.1% 14.3% 2.688s 3.45e-02s 78 500 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
7.1% 21.3% 2.671s 3.42e-02s 78 312 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
7.0% 28.4% 2.670s 3.42e-02s 78 408 AdvancedSubtensor(HostFromGpu.0, Join.0, Join.0, Join.0)
6.2% 34.6% 2.343s 3.00e-02s 78 409 GpuFromHost(AdvancedSubtensor.0)
6.1% 40.7% 2.321s 2.98e-02s 78 501 GpuFromHost(AdvancedSubtensor.0)
6.1% 46.8% 2.312s 2.96e-02s 78 313 GpuFromHost(AdvancedSubtensor.0)
6.1% 52.9% 2.311s 2.96e-02s 78 217 GpuFromHost(AdvancedSubtensor.0)
4.3% 57.2% 1.639s 3.28e-01s 5 902 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
4.2% 61.5% 1.606s 3.21e-01s 5 1192 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
4.2% 65.7% 1.602s 3.20e-01s 5 979 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
4.2% 69.9% 1.597s 3.19e-01s 5 837 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Join.0, Join.0, Join.0)
2.4% 72.3% 0.902s 1.16e-02s 78 150 HostFromGpu(GpuDimShuffle{0,2,3,1}.0)
1.1% 73.3% 0.402s 5.15e-03s 78 514 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.1% 74.4% 0.400s 5.13e-03s 78 230 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.1% 75.4% 0.399s 5.11e-03s 78 422 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
1.1% 76.5% 0.399s 5.11e-03s 78 326 GpuContiguous(GpuDimShuffle{0,3,1,2}.0)
0.9% 77.4% 0.352s 4.52e-03s 78 105 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{workmem='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
0.5% 78.0% 0.199s 3.98e-02s 5 1190 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
0.5% 78.5% 0.199s 3.97e-02s 5 835 HostFromGpu(GpuIncSubtensor{InplaceInc;int32::}.0)
... (remaining 1755 Apply instances account for 21.52%(8.15s) of the runtime)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment