Created
May 6, 2016 15:29
-
-
Save rizar/0d213d05a9d1f197219c1eb674947496 to your computer and use it in GitHub Desktop.
Old profile
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 100 calls to Function.__call__: 1.984119e-03s | |
Time in Function.fn.__call__: 8.468628e-04s (42.682%) | |
Total compile time: 5.483155e+00s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 1.670289e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.310276e-04s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.781s | |
No execution time accumulated (hint: try config profiling.time_thunks=1) | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 11 calls to Function.__call__: 2.355814e-02s | |
Time in Function.fn.__call__: 2.024937e-02s (85.955%) | |
Time in thunks: 9.337664e-03s (39.637%) | |
Total compile time: 6.343132e+00s | |
Number of Apply nodes: 43 | |
Theano Optimizer time: 3.600280e-01s | |
Theano validate time: 2.064705e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.223059e-01s | |
Import time 3.409195e-02s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.781s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.009s 1.97e-05s C 473 43 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.009s 1.97e-05s C 473 43 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
4.8% 4.8% 0.000s 4.09e-05s 11 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.8% 7.6% 0.000s 2.36e-05s 11 21 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.8% 10.4% 0.000s 2.34e-05s 11 25 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.7% 13.0% 0.000s 2.27e-05s 11 8 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 15.7% 0.000s 2.23e-05s 11 27 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 18.3% 0.000s 2.21e-05s 11 23 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 20.9% 0.000s 2.21e-05s 11 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 23.5% 0.000s 2.19e-05s 11 32 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 26.0% 0.000s 2.19e-05s 11 17 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 28.6% 0.000s 2.17e-05s 11 16 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 31.1% 0.000s 2.16e-05s 11 24 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 33.7% 0.000s 2.15e-05s 11 31 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 36.2% 0.000s 2.15e-05s 11 29 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 38.7% 0.000s 2.14e-05s 11 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 41.2% 0.000s 2.11e-05s 11 3 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 43.7% 0.000s 2.10e-05s 11 28 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 46.2% 0.000s 2.10e-05s 11 36 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 48.6% 0.000s 2.09e-05s 11 33 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 51.1% 0.000s 2.09e-05s 11 5 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 53.5% 0.000s 2.09e-05s 11 35 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 23 Apply instances account for 46.46%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 43 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 10 calls to Function.__call__: 1.226211e-02s | |
Time in Function.fn.__call__: 1.183033e-02s (96.479%) | |
Time in thunks: 4.946470e-03s (40.339%) | |
Total compile time: 6.681131e+00s | |
Number of Apply nodes: 29 | |
Theano Optimizer time: 1.198421e-01s | |
Theano validate time: 2.441406e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.311059e-01s | |
Import time 6.275487e-02s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.787s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
52.6% 52.6% 0.003s 1.73e-05s C 150 15 theano.sandbox.cuda.basic_ops.HostFromGpu | |
44.3% 96.8% 0.002s 2.43e-05s C 90 9 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.2% 100.0% 0.000s 3.13e-06s C 50 5 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
52.6% 52.6% 0.003s 1.73e-05s C 150 15 HostFromGpu | |
44.3% 96.8% 0.002s 2.43e-05s C 90 9 GpuElemwise{true_div,no_inplace} | |
3.2% 100.0% 0.000s 3.13e-06s C 50 5 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
10.8% 10.8% 0.001s 5.32e-05s 10 0 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
6.2% 17.0% 0.000s 3.09e-05s 10 15 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
4.4% 21.4% 0.000s 2.20e-05s 10 13 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.3% 25.8% 0.000s 2.14e-05s 10 1 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.2% 29.9% 0.000s 2.06e-05s 10 12 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 34.1% 0.000s 2.05e-05s 10 2 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 38.2% 0.000s 2.05e-05s 10 4 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 42.3% 0.000s 2.03e-05s 10 5 GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 46.4% 0.000s 2.03e-05s 10 3 GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 50.5% 0.000s 2.01e-05s 10 6 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 54.1% 0.000s 1.77e-05s 10 19 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 57.5% 0.000s 1.72e-05s 10 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 61.0% 0.000s 1.70e-05s 10 7 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 64.4% 0.000s 1.67e-05s 10 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 67.7% 0.000s 1.67e-05s 10 8 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 71.0% 0.000s 1.64e-05s 10 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 74.4% 0.000s 1.64e-05s 10 26 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 77.7% 0.000s 1.64e-05s 10 21 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 81.0% 0.000s 1.63e-05s 10 27 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 84.2% 0.000s 1.62e-05s 10 20 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 9 Apply instances account for 15.76%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 29 Apply account for 136B/136B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 101 calls to Function.__call__: 1.714706e-02s | |
Time in Function.fn.__call__: 1.415157e-02s (82.531%) | |
Time in thunks: 2.484560e-03s (14.490%) | |
Total compile time: 6.216795e+00s | |
Number of Apply nodes: 6 | |
Theano Optimizer time: 4.745817e-02s | |
Theano validate time: 1.499653e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.376604e-02s | |
Import time 1.632404e-02s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.791s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
54.5% 54.5% 0.001s 3.35e-06s C 404 4 theano.compile.ops.Shape_i | |
45.5% 100.0% 0.001s 5.60e-06s C 202 2 theano.tensor.basic.Alloc | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
45.5% 45.5% 0.001s 5.60e-06s C 202 2 Alloc | |
31.1% 76.6% 0.001s 3.82e-06s C 202 2 Shape_i{1} | |
23.4% 100.0% 0.001s 2.88e-06s C 202 2 Shape_i{0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
28.2% 28.2% 0.001s 6.94e-06s 101 4 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
input 0: dtype=int64, shape=(1, 1), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(15, 10), strides=c | |
19.2% 47.4% 0.000s 4.71e-06s 101 0 Shape_i{1}(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
17.3% 64.7% 0.000s 4.27e-06s 101 5 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
input 0: dtype=int64, shape=(1, 1), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(12, 10), strides=c | |
12.8% 77.5% 0.000s 3.16e-06s 101 1 Shape_i{0}(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
11.9% 89.4% 0.000s 2.92e-06s 101 2 Shape_i{1}(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
10.6% 100.0% 0.000s 2.60e-06s 101 3 Shape_i{0}(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 2KB | |
GPU: 0KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1200B [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
... (remaining 5 Apply account for 992B/2192B ((45.26%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 100 calls to Function.__call__: 1.662898e-02s | |
Time in Function.fn.__call__: 1.507092e-02s (90.630%) | |
Time in thunks: 1.027775e-02s (61.806%) | |
Total compile time: 5.965592e+00s | |
Number of Apply nodes: 2 | |
Theano Optimizer time: 1.966500e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.714872e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.793s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.010s 5.14e-05s C 200 2 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.010s 5.14e-05s C 200 2 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
95.6% 95.6% 0.010s 9.82e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10), strides=c | |
4.4% 100.0% 0.000s 4.54e-06s 100 1 DeepCopyOp(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(12, 10), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 2KB | |
GPU: 0KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1200B [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction) | |
... (remaining 1 Apply account for 960B/2160B ((44.44%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 2 calls to Function.__call__: 5.192757e-03s | |
Time in Function.fn.__call__: 4.395008e-03s (84.637%) | |
Time in thunks: 1.830101e-03s (35.243%) | |
Total compile time: 5.798583e+00s | |
Number of Apply nodes: 31 | |
Theano Optimizer time: 1.590829e-01s | |
Theano validate time: 1.525164e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.815388e-02s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.794s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.002s 2.95e-05s C 62 31 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.002s 2.95e-05s C 62 31 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
4.6% 4.6% 0.000s 4.20e-05s 2 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.3% 8.9% 0.000s 3.96e-05s 2 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.1% 13.1% 0.000s 3.79e-05s 2 23 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.1% 17.1% 0.000s 3.74e-05s 2 13 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.9% 21.0% 0.000s 3.55e-05s 2 4 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.7% 24.7% 0.000s 3.40e-05s 2 21 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.7% 28.5% 0.000s 3.40e-05s 2 14 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.7% 32.2% 0.000s 3.40e-05s 2 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 35.8% 0.000s 3.30e-05s 2 3 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 39.3% 0.000s 3.25e-05s 2 8 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 42.9% 0.000s 3.25e-05s 2 7 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 46.4% 0.000s 3.21e-05s 2 15 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 49.9% 0.000s 3.21e-05s 2 9 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 53.3% 0.000s 3.16e-05s 2 16 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 56.8% 0.000s 3.16e-05s 2 5 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 60.2% 0.000s 3.15e-05s 2 22 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 63.7% 0.000s 3.15e-05s 2 20 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 67.1% 0.000s 3.15e-05s 2 19 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 70.6% 0.000s 3.15e-05s 2 18 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 74.0% 0.000s 3.11e-05s 2 17 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 11 Apply instances account for 26.04%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 31 Apply account for 140B/140B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 8.800030e-04s | |
Time in Function.fn.__call__: 8.380413e-04s (95.232%) | |
Time in thunks: 3.595352e-04s (40.856%) | |
Total compile time: 6.387939e+00s | |
Number of Apply nodes: 21 | |
Theano Optimizer time: 8.277297e-02s | |
Theano validate time: 1.749992e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.883909e-02s | |
Import time 4.663944e-03s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.798s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
53.4% 53.4% 0.000s 1.75e-05s C 11 11 theano.sandbox.cuda.basic_ops.HostFromGpu | |
42.3% 95.8% 0.000s 2.54e-05s C 6 6 theano.sandbox.cuda.basic_ops.GpuElemwise | |
4.2% 100.0% 0.000s 3.81e-06s C 4 4 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
53.4% 53.4% 0.000s 1.75e-05s C 11 11 HostFromGpu | |
42.3% 95.8% 0.000s 2.54e-05s C 6 6 GpuElemwise{true_div,no_inplace} | |
4.2% 100.0% 0.000s 3.81e-06s C 4 4 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
11.7% 11.7% 0.000s 4.20e-05s 1 8 GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.4% 18.0% 0.000s 2.29e-05s 1 1 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.2% 24.2% 0.000s 2.22e-05s 1 3 GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.1% 30.3% 0.000s 2.19e-05s 1 7 GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.1% 36.4% 0.000s 2.19e-05s 1 2 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.1% 42.5% 0.000s 2.19e-05s 1 0 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.9% 48.4% 0.000s 2.12e-05s 1 6 GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.0% 53.4% 0.000s 1.81e-05s 1 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.0% 58.5% 0.000s 1.81e-05s 1 12 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.0% 63.5% 0.000s 1.79e-05s 1 11 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.8% 68.2% 0.000s 1.72e-05s 1 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.8% 73.0% 0.000s 1.72e-05s 1 13 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.7% 77.7% 0.000s 1.69e-05s 1 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.7% 82.4% 0.000s 1.69e-05s 1 5 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.4% 86.9% 0.000s 1.60e-05s 1 10 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.4% 91.3% 0.000s 1.60e-05s 1 9 HostFromGpu(shared_weights_penalty) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.4% 95.8% 0.000s 1.60e-05s 1 4 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
1.4% 97.1% 0.000s 5.01e-06s 1 19 Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
1.1% 98.3% 0.000s 4.05e-06s 1 20 Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0) | |
input 0: dtype=float64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
0.9% 99.1% 0.000s 3.10e-06s 1 15 Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 1 Apply instances account for 0.86%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 21 Apply account for 100B/100B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 1 calls to Function.__call__: 4.670620e-04s | |
Time in Function.fn.__call__: 3.008842e-04s (64.421%) | |
Time in thunks: 1.330376e-04s (28.484%) | |
Total compile time: 7.051143e+00s | |
Number of Apply nodes: 5 | |
Theano Optimizer time: 3.080988e-02s | |
Theano validate time: 2.636909e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 9.856939e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.801s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.000s 2.66e-05s C 5 5 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.000s 2.66e-05s C 5 5 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
57.2% 57.2% 0.000s 7.61e-05s 1 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
16.5% 73.7% 0.000s 2.19e-05s 1 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
15.1% 88.7% 0.000s 2.00e-05s 1 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
8.2% 97.0% 0.000s 1.10e-05s 1 3 DeepCopyOp(TensorConstant{0}) | |
input 0: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
3.0% 100.0% 0.000s 4.05e-06s 1 4 DeepCopyOp(TensorConstant{0.0}) | |
input 0: dtype=float64, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 5 Apply account for 28B/28B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 1.440048e-04s | |
Time in Function.fn.__call__: 1.199245e-04s (83.278%) | |
Time in thunks: 3.504753e-05s (24.338%) | |
Total compile time: 5.531962e+00s | |
Number of Apply nodes: 3 | |
Theano Optimizer time: 2.350092e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.795074e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.802s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
71.4% 71.4% 0.000s 2.50e-05s C 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
17.0% 88.4% 0.000s 5.96e-06s C 1 1 theano.compile.ops.DeepCopyOp | |
11.6% 100.0% 0.000s 4.05e-06s C 1 1 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
71.4% 71.4% 0.000s 2.50e-05s C 1 1 HostFromGpu | |
17.0% 88.4% 0.000s 5.96e-06s C 1 1 DeepCopyOp | |
11.6% 100.0% 0.000s 4.05e-06s C 1 1 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
71.4% 71.4% 0.000s 2.50e-05s 1 1 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
17.0% 88.4% 0.000s 5.96e-06s 1 0 DeepCopyOp(shared_batch_size) | |
input 0: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
11.6% 100.0% 0.000s 4.05e-06s 1 2 Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0) | |
input 0: dtype=float64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 3 Apply account for 20B/20B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286 | |
Time in 61 calls to Function.__call__: 1.151932e+01s | |
Time in Function.fn.__call__: 1.151220e+01s (99.938%) | |
Time in thunks: 1.112233e+01s (96.554%) | |
Total compile time: 6.020690e+01s | |
Number of Apply nodes: 284 | |
Theano Optimizer time: 6.218818e+00s | |
Theano validate time: 2.867708e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.509264e+01s | |
Import time 3.776977e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.803s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
76.0% 76.0% 8.452s 1.39e-01s Py 61 1 lvsr.ops.EditDistanceOp | |
23.0% 99.0% 2.554s 2.09e-02s Py 122 2 theano.scan_module.scan_op.Scan | |
0.2% 99.1% 0.021s 2.85e-06s C 7320 120 theano.tensor.elemwise.Elemwise | |
0.2% 99.3% 0.020s 6.64e-05s C 305 5 theano.sandbox.cuda.blas.GpuDot22 | |
0.1% 99.4% 0.013s 3.00e-05s C 427 7 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.1% 99.5% 0.008s 3.37e-05s C 244 4 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.1% 99.6% 0.007s 1.22e-04s C 61 1 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.1% 99.6% 0.007s 2.16e-05s C 305 5 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.0% 99.7% 0.005s 2.85e-06s C 1586 26 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.0% 99.7% 0.004s 2.92e-06s C 1464 24 theano.compile.ops.Shape_i | |
0.0% 99.8% 0.004s 2.23e-05s C 183 3 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.0% 99.8% 0.003s 5.64e-05s C 61 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
0.0% 99.8% 0.003s 3.31e-06s C 1037 17 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.0% 99.8% 0.003s 2.76e-05s C 122 2 theano.compile.ops.DeepCopyOp | |
0.0% 99.9% 0.003s 2.74e-06s C 1098 18 theano.tensor.opt.MakeVector | |
0.0% 99.9% 0.002s 2.32e-06s C 1037 17 theano.tensor.basic.ScalarFromTensor | |
0.0% 99.9% 0.002s 7.52e-06s Py 305 3 theano.ifelse.IfElse | |
0.0% 99.9% 0.002s 4.18e-06s C 549 9 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 100.0% 0.002s 5.26e-06s C 305 5 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.0% 100.0% 0.001s 6.62e-06s Py 183 3 theano.compile.ops.Rebroadcast | |
... (remaining 8 Classes account for 0.04%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
76.0% 76.0% 8.452s 1.39e-01s Py 61 1 EditDistanceOp | |
18.2% 94.2% 2.026s 3.32e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan} | |
4.8% 99.0% 0.528s 8.66e-03s Py 61 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
0.2% 99.1% 0.020s 6.64e-05s C 305 5 GpuDot22 | |
0.1% 99.2% 0.010s 3.33e-05s C 305 5 GpuElemwise{Add}[(0, 0)] | |
0.1% 99.3% 0.007s 1.22e-04s C 61 1 GpuJoin | |
0.1% 99.4% 0.007s 3.74e-05s C 183 3 GpuAlloc | |
0.1% 99.4% 0.007s 2.16e-05s C 305 5 GpuIncSubtensor{InplaceSet;:int64:} | |
0.0% 99.5% 0.004s 2.23e-05s C 183 3 HostFromGpu | |
0.0% 99.5% 0.003s 5.64e-05s C 61 1 GpuAdvancedSubtensor1 | |
0.0% 99.5% 0.003s 2.76e-05s C 122 2 DeepCopyOp | |
0.0% 99.5% 0.003s 2.74e-06s C 1098 18 MakeVector{dtype='int64'} | |
0.0% 99.6% 0.002s 2.32e-06s C 1037 17 ScalarFromTensor | |
0.0% 99.6% 0.002s 2.80e-06s C 793 13 Shape_i{0} | |
0.0% 99.6% 0.002s 3.21e-06s C 671 11 GpuReshape{2} | |
0.0% 99.6% 0.002s 3.06e-06s C 671 11 Shape_i{1} | |
0.0% 99.6% 0.002s 2.64e-06s C 671 11 Elemwise{add,no_inplace} | |
0.0% 99.7% 0.002s 2.66e-06s C 610 10 Elemwise{sub,no_inplace} | |
0.0% 99.7% 0.002s 5.26e-06s C 305 5 GpuAllocEmpty | |
0.0% 99.7% 0.002s 8.29e-06s Py 183 2 if{inplace} | |
... (remaining 76 Ops account for 0.32%(0.04s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
76.0% 76.0% 8.452s 1.39e-01s 61 279 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask) | |
input 0: dtype=int64, shape=(15, 75), strides=c | |
input 1: dtype=float32, shape=(15, 75), strides=c | |
input 2: dtype=int64, shape=(12, 75), strides=c | |
input 3: dtype=float32, shape=(12, 75), strides=c | |
output 0: dtype=int64, shape=(15, 75, 1), strides=c | |
18.2% 94.2% 2.026s 3.32e-02s 61 268 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 75), strides=(75, 1) | |
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(75,), strides=c | |
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 75), strides=c | |
4.8% 99.0% 0.528s 8.66e-03s 61 254 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) | |
input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) | |
input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 9: dtype=float32, shape=(100, 200), strides=c | |
input 10: dtype=float32, shape=(100, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.1% 99.0% 0.007s 1.22e-04s 61 262 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.0% 99.1% 0.005s 7.75e-05s 61 148 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.0% 99.1% 0.005s 7.65e-05s 61 150 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.0% 99.1% 0.004s 7.11e-05s 61 265 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.2% 0.003s 5.64e-05s 61 72 GpuAdvancedSubtensor1(W, Reshape{1}.0) | |
input 0: dtype=float32, shape=(44, 100), strides=c | |
input 1: dtype=int64, shape=(900,), strides=c | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.2% 0.003s 5.48e-05s 61 147 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.2% 0.003s 5.23e-05s 61 149 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.3% 0.002s 3.99e-05s 61 53 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.3% 0.002s 3.81e-05s 61 178 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.0% 99.3% 0.002s 3.76e-05s 61 180 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.0% 99.3% 0.002s 3.63e-05s 61 65 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.3% 0.002s 3.61e-05s 61 116 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.4% 0.002s 3.24e-05s 61 4 DeepCopyOp(CudaNdarrayConstant{1.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
0.0% 99.4% 0.002s 3.13e-05s 61 177 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.4% 0.002s 3.02e-05s 61 267 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.4% 0.002s 2.95e-05s 61 179 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.4% 0.002s 2.76e-05s 61 0 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 264 Apply instances account for 0.58%(0.06s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 18KB (18KB) | |
GPU: 3168KB (3653KB) | |
CPU + GPU: 3185KB (3671KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 18KB (18KB) | |
GPU: 3519KB (4327KB) | |
CPU + GPU: 3537KB (4345KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 37KB | |
GPU: 5180KB | |
CPU + GPU: 5217KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
720000B [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
720000B [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>) | |
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0) | |
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0) | |
360000B [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
360000B [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
... (remaining 264 Apply account for 8196678B/20973438B ((39.08%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 61 calls of the op (for a total of 732 steps) 5.235906e-01s | |
Total time spent in calling the VM 5.032728e-01s (96.120%) | |
Total overhead (computing slices..) 2.031779e-02s (3.880%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
54.6% 54.6% 0.153s 5.23e-05s C 2928 4 theano.sandbox.cuda.blas.GpuGemm | |
42.1% 96.7% 0.118s 2.02e-05s C 5856 8 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.3% 100.0% 0.009s 3.15e-06s C 2928 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
54.6% 54.6% 0.153s 5.23e-05s C 2928 4 GpuGemm{no_inplace} | |
11.6% 66.2% 0.033s 2.22e-05s C 1464 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
10.4% 76.6% 0.029s 2.00e-05s C 1464 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
10.1% 86.7% 0.028s 1.93e-05s C 1464 2 GpuElemwise{mul,no_inplace} | |
10.0% 96.7% 0.028s 1.92e-05s C 1464 2 GpuElemwise{sub,no_inplace} | |
1.8% 98.5% 0.005s 3.36e-06s C 1464 2 GpuSubtensor{::, :int64:} | |
1.5% 100.0% 0.004s 2.93e-06s C 1464 2 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
13.8% 13.8% 0.039s 5.31e-05s 732 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
13.6% 27.5% 0.038s 5.22e-05s 732 3 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
13.6% 41.0% 0.038s 5.20e-05s 732 12 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
13.6% 54.6% 0.038s 5.20e-05s 732 13 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.9% 60.5% 0.016s 2.25e-05s 732 14 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 1), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.7% 66.2% 0.016s 2.20e-05s 732 15 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 1), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.3% 71.5% 0.015s 2.02e-05s 732 4 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
5.2% 76.7% 0.015s 2.00e-05s 732 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 1), strides=c | |
5.2% 81.9% 0.015s 1.98e-05s 732 5 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
5.0% 86.9% 0.014s 1.93e-05s 732 10 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.0% 91.9% 0.014s 1.93e-05s 732 11 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
4.8% 96.7% 0.013s 1.83e-05s 732 2 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 1), strides=c | |
0.9% 97.6% 0.003s 3.44e-06s 732 8 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.9% 98.5% 0.002s 3.29e-06s 732 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.8% 99.3% 0.002s 3.11e-06s 732 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.7% 100.0% 0.002s 2.75e-06s 732 9 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 147KB (206KB) | |
CPU + GPU: 147KB (206KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 147KB (206KB) | |
CPU + GPU: 147KB (206KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 294KB | |
CPU + GPU: 294KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
... (remaining 2 Apply account for 600B/540600B ((0.11%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( generator_generate_scan ) | |
================== | |
Message: None | |
Time in 61 calls of the op (for a total of 915 steps) 2.016554e+00s | |
Total time spent in calling the VM 1.933907e+00s (95.902%) | |
Total overhead (computing slices..) 8.264709e-02s (4.098%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
27.2% 27.2% 0.275s 2.31e-05s C 11895 13 theano.sandbox.cuda.basic_ops.GpuElemwise | |
20.7% 47.9% 0.209s 4.58e-05s C 4575 5 theano.sandbox.cuda.blas.GpuDot22 | |
20.3% 68.3% 0.205s 4.49e-05s C 4575 5 theano.sandbox.cuda.blas.GpuGemm | |
10.4% 78.7% 0.105s 2.29e-05s C 4575 5 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
4.0% 82.7% 0.041s 4.47e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
3.8% 86.5% 0.038s 2.09e-05s C 1830 2 theano.sandbox.cuda.basic_ops.HostFromGpu | |
3.4% 89.9% 0.034s 3.75e-05s C 915 1 theano.sandbox.rng_mrg.GPU_mrg_uniform | |
2.3% 92.2% 0.024s 2.58e-05s C 915 1 theano.tensor.basic.MaxAndArgmax | |
1.4% 93.7% 0.014s 2.26e-06s C 6405 7 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.3% 95.0% 0.014s 1.48e-05s C 915 1 theano.sandbox.multinomial.MultinomialFromUniform | |
1.2% 96.2% 0.012s 1.35e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuFromHost | |
1.0% 97.2% 0.010s 2.21e-06s C 4575 5 theano.compile.ops.Shape_i | |
0.9% 98.1% 0.009s 3.13e-06s C 2745 3 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.7% 98.7% 0.007s 1.84e-06s C 3660 4 theano.tensor.opt.MakeVector | |
0.6% 99.3% 0.006s 3.25e-06s C 1830 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.4% 99.7% 0.004s 2.12e-06s C 1830 2 theano.tensor.elemwise.Elemwise | |
0.3% 100.0% 0.003s 3.20e-06s C 915 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
20.7% 20.7% 0.209s 4.58e-05s C 4575 5 GpuDot22 | |
20.3% 41.1% 0.205s 4.49e-05s C 4575 5 GpuGemm{inplace} | |
5.4% 46.5% 0.055s 3.01e-05s C 1830 2 GpuElemwise{mul,no_inplace} | |
4.0% 50.6% 0.041s 4.47e-05s C 915 1 GpuAdvancedSubtensor1 | |
3.8% 54.3% 0.038s 2.09e-05s C 1830 2 HostFromGpu | |
3.4% 57.7% 0.034s 3.75e-05s C 915 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace} | |
2.7% 60.4% 0.027s 2.98e-05s C 915 1 GpuElemwise{add,no_inplace} | |
2.6% 63.0% 0.026s 2.83e-05s C 915 1 GpuCAReduce{add}{1,0,0} | |
2.4% 65.4% 0.024s 2.62e-05s C 915 1 GpuCAReduce{maximum}{1,0} | |
2.3% 67.7% 0.024s 2.58e-05s C 915 1 MaxAndArgmax | |
2.3% 70.0% 0.023s 2.57e-05s C 915 1 GpuElemwise{Tanh}[(0, 0)] | |
2.1% 72.2% 0.022s 2.36e-05s C 915 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace} | |
2.0% 74.2% 0.021s 2.25e-05s C 915 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
1.9% 76.1% 0.019s 2.06e-05s C 915 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
1.9% 77.9% 0.019s 2.05e-05s C 915 1 GpuCAReduce{maximum}{0,1} | |
1.8% 79.8% 0.019s 2.03e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
1.8% 81.6% 0.018s 2.00e-05s C 915 1 GpuElemwise{TrueDiv}[(0, 0)] | |
1.8% 83.4% 0.018s 2.00e-05s C 915 1 GpuCAReduce{add}{1,0} | |
1.8% 85.2% 0.018s 1.99e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)] | |
1.8% 87.0% 0.018s 1.99e-05s C 915 1 GpuElemwise{Add}[(0, 1)] | |
... (remaining 21 Ops account for 13.00%(0.13s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
6.7% 6.7% 0.067s 7.36e-05s 915 47 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(900, 100), strides=c | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(900, 1), strides=c | |
4.4% 11.1% 0.045s 4.87e-05s 915 39 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
4.4% 15.4% 0.044s 4.83e-05s 915 11 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
4.3% 19.8% 0.044s 4.76e-05s 915 9 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 44), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
4.0% 23.8% 0.041s 4.47e-05s 915 30 GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(75,), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.8% 27.6% 0.039s 4.24e-05s 915 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
3.6% 31.3% 0.037s 4.01e-05s 915 40 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.6% 34.9% 0.037s 4.00e-05s 915 57 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
input 0: dtype=float32, shape=(12, 75, 1), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=c | |
output 0: dtype=float32, shape=(12, 75, 200), strides=c | |
3.6% 38.5% 0.036s 3.99e-05s 915 33 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
3.5% 42.0% 0.035s 3.84e-05s 915 6 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
3.4% 45.4% 0.034s 3.75e-05s 915 14 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(92160,), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(92160,), strides=c | |
output 1: dtype=float32, shape=(75,), strides=c | |
3.4% 48.8% 0.034s 3.73e-05s 915 38 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.4% 52.1% 0.034s 3.73e-05s 915 42 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
2.7% 54.8% 0.027s 2.98e-05s 915 44 GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=c | |
2.6% 57.4% 0.026s 2.83e-05s 915 58 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
2.4% 59.8% 0.024s 2.62e-05s 915 49 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 75), strides=c | |
output 0: dtype=float32, shape=(75,), strides=c | |
2.3% 62.1% 0.024s 2.58e-05s 915 28 MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1}) | |
input 0: dtype=int64, shape=(75, 44), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=int64, shape=(75,), strides=c | |
output 1: dtype=int64, shape=(75,), strides=c | |
2.3% 64.4% 0.023s 2.57e-05s 915 46 GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=c | |
output 0: dtype=float32, shape=(900, 100), strides=c | |
2.1% 66.6% 0.022s 2.36e-05s 915 26 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(75, 44), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
2.1% 68.7% 0.022s 2.36e-05s 915 41 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
... (remaining 39 Apply instances account for 31.29%(0.32s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 39KB (39KB) | |
GPU: 1151KB (1151KB) | |
CPU + GPU: 1190KB (1190KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 39KB (39KB) | |
GPU: 1151KB (1151KB) | |
CPU + GPU: 1190KB (1190KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 41KB | |
GPU: 1709KB | |
CPU + GPU: 1750KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
720000B [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
368940B [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
360000B [(12, 75, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace[cuda]) | |
360000B [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
360000B [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
360000B [(12, 75, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
60000B [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
60000B [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
60000B [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
60000B [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
60000B [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
30000B [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
30000B [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
30000B [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
30000B [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
30000B [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
... (remaining 39 Apply account for 188879B/3287819B ((5.74%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103 | |
Time in 11 calls to Function.__call__: 1.403611e-01s | |
Time in Function.fn.__call__: 1.400502e-01s (99.779%) | |
Time in thunks: 9.480190e-02s (67.541%) | |
Total compile time: 6.756872e+01s | |
Number of Apply nodes: 190 | |
Theano Optimizer time: 4.246896e+00s | |
Theano validate time: 1.580198e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 5.792800e+01s | |
Import time 1.193612e-01s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.896s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
89.0% 89.0% 0.084s 3.84e-03s Py 22 2 theano.scan_module.scan_op.Scan | |
3.0% 92.0% 0.003s 2.59e-06s C 1089 99 theano.tensor.elemwise.Elemwise | |
1.9% 93.9% 0.002s 4.05e-05s C 44 4 theano.sandbox.cuda.blas.GpuDot22 | |
0.9% 94.8% 0.001s 1.99e-05s C 44 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.7% 95.5% 0.001s 6.24e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.6% 96.1% 0.001s 2.50e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.5% 96.7% 0.001s 4.63e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
0.5% 97.1% 0.000s 2.08e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 97.6% 0.000s 3.18e-06s C 132 12 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.4% 98.0% 0.000s 2.77e-06s C 143 13 theano.compile.ops.Shape_i | |
0.4% 98.4% 0.000s 2.77e-06s C 143 13 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.3% 98.8% 0.000s 2.48e-06s C 132 12 theano.tensor.opt.MakeVector | |
0.3% 99.1% 0.000s 2.21e-06s C 132 12 theano.tensor.basic.ScalarFromTensor | |
0.3% 99.3% 0.000s 2.39e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.3% 99.6% 0.000s 3.71e-06s C 66 6 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.2% 99.8% 0.000s 6.49e-06s Py 22 2 theano.compile.ops.Rebroadcast | |
0.1% 99.9% 0.000s 6.40e-06s C 22 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.1% 100.0% 0.000s 5.25e-06s C 11 1 theano.tensor.basic.Alloc | |
0.0% 100.0% 0.000s 3.21e-06s C 11 1 theano.tensor.basic.Reshape | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
89.0% 89.0% 0.084s 3.84e-03s Py 22 2 forall_inplace,gpu,gatedrecurrent_apply_scan} | |
1.9% 90.9% 0.002s 4.05e-05s C 44 4 GpuDot22 | |
0.9% 91.8% 0.001s 1.99e-05s C 44 4 GpuElemwise{Add}[(0, 0)] | |
0.7% 92.6% 0.001s 6.24e-05s C 11 1 GpuJoin | |
0.6% 93.1% 0.001s 2.50e-05s C 22 2 GpuAlloc | |
0.5% 93.7% 0.001s 4.63e-05s C 11 1 GpuAdvancedSubtensor1 | |
0.5% 94.2% 0.000s 2.08e-05s C 22 2 GpuIncSubtensor{InplaceSet;:int64:} | |
0.3% 94.5% 0.000s 2.48e-06s C 132 12 MakeVector{dtype='int64'} | |
0.3% 94.8% 0.000s 2.21e-06s C 132 12 ScalarFromTensor | |
0.3% 95.1% 0.000s 2.39e-05s C 11 1 HostFromGpu | |
0.3% 95.4% 0.000s 2.57e-06s C 99 9 Elemwise{add,no_inplace} | |
0.3% 95.6% 0.000s 3.19e-06s C 77 7 GpuReshape{2} | |
0.2% 95.9% 0.000s 2.97e-06s C 77 7 Shape_i{0} | |
0.2% 96.1% 0.000s 2.43e-06s C 88 8 Elemwise{le,no_inplace} | |
0.2% 96.3% 0.000s 2.83e-06s C 66 6 GpuDimShuffle{x,x,0} | |
0.2% 96.5% 0.000s 3.16e-06s C 55 5 GpuReshape{3} | |
0.2% 96.6% 0.000s 2.55e-06s C 66 6 Shape_i{1} | |
0.2% 96.8% 0.000s 2.50e-06s C 66 6 Elemwise{sub,no_inplace} | |
0.2% 97.0% 0.000s 6.49e-06s Py 22 2 Rebroadcast{0} | |
0.1% 97.1% 0.000s 6.40e-06s C 22 2 GpuAllocEmpty | |
... (remaining 56 Ops account for 2.88%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
45.0% 45.0% 0.043s 3.88e-03s 11 140 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
44.0% 89.0% 0.042s 3.79e-03s 11 182 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.7% 89.8% 0.001s 6.24e-05s 11 188 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.5% 90.3% 0.001s 4.63e-05s 11 31 GpuAdvancedSubtensor1(W, Reshape{1}.0) | |
input 0: dtype=float32, shape=(44, 100), strides=c | |
input 1: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.5% 90.8% 0.000s 4.20e-05s 11 58 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
0.5% 91.2% 0.000s 4.02e-05s 11 55 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.5% 91.7% 0.000s 4.00e-05s 11 57 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.5% 92.2% 0.000s 3.98e-05s 11 56 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
0.3% 92.5% 0.000s 2.64e-05s 11 103 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.3% 92.8% 0.000s 2.39e-05s 11 189 HostFromGpu(GpuJoin.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=c | |
0.3% 93.0% 0.000s 2.36e-05s 11 71 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.2% 93.3% 0.000s 2.14e-05s 11 137 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 93.5% 0.000s 2.01e-05s 11 167 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 93.7% 0.000s 2.01e-05s 11 78 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 94.0% 0.000s 1.99e-05s 11 79 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.2% 94.2% 0.000s 1.99e-05s 11 80 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 94.4% 0.000s 1.97e-05s 11 81 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.1% 94.5% 0.000s 7.91e-06s 11 124 GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.1% 94.6% 0.000s 6.59e-06s 11 132 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.1% 94.7% 0.000s 6.39e-06s 11 98 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
... (remaining 170 Apply instances account for 5.32%(0.01s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 9KB (9KB) | |
GPU: 28KB (34KB) | |
CPU + GPU: 38KB (43KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 9KB (9KB) | |
GPU: 33KB (38KB) | |
CPU + GPU: 42KB (48KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 10KB | |
GPU: 52KB | |
CPU + GPU: 63KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
80000B [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0) | |
80000B [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0) | |
80000B [(100, 200)] v GpuDimShuffle{0,1}(W) | |
80000B [(100, 200)] v GpuDimShuffle{0,1}(W) | |
40000B [(100, 100)] v GpuDimShuffle{0,1}(W) | |
40000B [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0) | |
40000B [(100, 100)] v GpuDimShuffle{0,1}(W) | |
40000B [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
9600B [(12, 1, 200)] c HostFromGpu(GpuJoin.0) | |
4800B [(12, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
... (remaining 170 Apply account for 94077B/679677B ((13.84%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 11 calls of the op (for a total of 132 steps) 4.214001e-02s | |
Total time spent in calling the VM 4.016280e-02s (95.308%) | |
Total overhead (computing slices..) 1.977205e-03s (4.692%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
61.2% 61.2% 0.013s 4.91e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm | |
34.9% 96.2% 0.007s 1.87e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.8% 100.0% 0.001s 3.07e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
61.2% 61.2% 0.013s 4.91e-05s C 264 2 GpuGemm{no_inplace} | |
11.9% 73.1% 0.003s 1.90e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace} | |
11.6% 84.7% 0.002s 1.86e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
11.5% 96.2% 0.002s 1.84e-05s C 132 1 GpuElemwise{mul,no_inplace} | |
2.1% 98.3% 0.000s 3.34e-06s C 132 1 GpuSubtensor{::, :int64:} | |
1.7% 100.0% 0.000s 2.80e-06s C 132 1 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
31.4% 31.4% 0.007s 5.04e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
29.8% 61.2% 0.006s 4.79e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
11.9% 73.1% 0.003s 1.90e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
11.6% 84.7% 0.002s 1.86e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
11.5% 96.2% 0.002s 1.84e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
2.1% 98.3% 0.000s 3.34e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
1.7% 100.0% 0.000s 2.80e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 11 calls of the op (for a total of 132 steps) 4.123449e-02s | |
Total time spent in calling the VM 3.931022e-02s (95.333%) | |
Total overhead (computing slices..) 1.924276e-03s (4.667%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
61.1% 61.1% 0.013s 4.84e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm | |
35.1% 96.2% 0.007s 1.85e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.8% 100.0% 0.001s 3.01e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
61.1% 61.1% 0.013s 4.84e-05s C 264 2 GpuGemm{no_inplace} | |
12.1% 73.2% 0.003s 1.92e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace} | |
11.6% 84.8% 0.002s 1.84e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
11.4% 96.2% 0.002s 1.80e-05s C 132 1 GpuElemwise{mul,no_inplace} | |
2.1% 98.3% 0.000s 3.31e-06s C 132 1 GpuSubtensor{::, :int64:} | |
1.7% 100.0% 0.000s 2.72e-06s C 132 1 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
31.3% 31.3% 0.007s 4.96e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
29.8% 61.1% 0.006s 4.72e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
12.1% 73.2% 0.003s 1.92e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
11.6% 84.8% 0.002s 1.84e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
11.4% 96.2% 0.002s 1.80e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
2.1% 98.3% 0.000s 3.31e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
1.7% 100.0% 0.000s 2.72e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111 | |
Time in 11 calls to Function.__call__: 2.407074e-03s | |
Time in Function.fn.__call__: 2.131939e-03s (88.570%) | |
Time in thunks: 4.451275e-04s (18.492%) | |
Total compile time: 2.637064e+01s | |
Number of Apply nodes: 8 | |
Theano Optimizer time: 8.200908e-02s | |
Theano validate time: 1.047134e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.873722e+01s | |
Import time 9.109974e-03s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.952s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
45.8% 45.8% 0.000s 1.85e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
19.9% 65.7% 0.000s 4.02e-06s C 22 2 theano.tensor.basic.Alloc | |
15.8% 81.5% 0.000s 3.20e-06s C 22 2 theano.compile.ops.Shape_i | |
6.9% 88.3% 0.000s 2.77e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuReshape | |
6.2% 94.5% 0.000s 2.49e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
5.5% 100.0% 0.000s 2.23e-06s C 11 1 theano.tensor.opt.MakeVector | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
45.8% 45.8% 0.000s 1.85e-05s C 11 1 HostFromGpu | |
19.9% 65.7% 0.000s 4.02e-06s C 22 2 Alloc | |
15.8% 81.5% 0.000s 3.20e-06s C 22 2 Shape_i{0} | |
6.9% 88.3% 0.000s 2.77e-06s C 11 1 GpuReshape{2} | |
6.2% 94.5% 0.000s 2.49e-06s C 11 1 GpuDimShuffle{x,x,0} | |
5.5% 100.0% 0.000s 2.23e-06s C 11 1 MakeVector{dtype='int64'} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
45.8% 45.8% 0.000s 1.85e-05s 11 7 HostFromGpu(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
10.6% 56.4% 0.000s 4.29e-06s 11 4 Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 12), strides=c | |
9.4% 65.8% 0.000s 3.79e-06s 11 0 Shape_i{0}(generator_generate_attended) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
9.3% 75.0% 0.000s 3.75e-06s 11 1 Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200}) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int16, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
6.9% 81.9% 0.000s 2.77e-06s 11 6 GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
6.4% 88.3% 0.000s 2.60e-06s 11 2 Shape_i{0}(initial_state) | |
input 0: dtype=float32, shape=(100,), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
6.2% 94.5% 0.000s 2.49e-06s 11 3 GpuDimShuffle{x,x,0}(initial_state) | |
input 0: dtype=float32, shape=(100,), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
5.5% 100.0% 0.000s 2.23e-06s 11 5 MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(2,), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 1KB (1KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 1KB (1KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 1KB (1KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 1KB (1KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 1KB | |
GPU: 0KB | |
CPU + GPU: 1KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126 | |
Time in 176 calls to Function.__call__: 4.303689e-01s | |
Time in Function.fn.__call__: 4.239194e-01s (98.501%) | |
Time in thunks: 1.613367e-01s (37.488%) | |
Total compile time: 9.262143e+00s | |
Number of Apply nodes: 79 | |
Theano Optimizer time: 5.638268e-01s | |
Theano validate time: 2.706265e-02s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.979231e-01s | |
Import time 1.633863e-01s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.954s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
21.9% 21.9% 0.035s 4.02e-05s C 880 5 theano.sandbox.cuda.blas.GpuDot22 | |
20.7% 42.7% 0.033s 1.90e-05s C 1760 10 theano.sandbox.cuda.basic_ops.GpuElemwise | |
17.9% 60.6% 0.029s 4.10e-05s C 704 4 theano.sandbox.cuda.blas.GpuGemm | |
7.3% 67.8% 0.012s 1.66e-05s C 704 4 theano.sandbox.cuda.basic_ops.HostFromGpu | |
7.2% 75.0% 0.012s 2.19e-05s C 528 3 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
7.1% 82.1% 0.012s 1.31e-05s C 880 5 theano.sandbox.cuda.basic_ops.GpuFromHost | |
4.8% 86.9% 0.008s 4.41e-05s C 176 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
3.5% 90.4% 0.006s 2.68e-06s C 2112 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
2.2% 92.7% 0.004s 2.29e-06s C 1584 9 theano.tensor.elemwise.Elemwise | |
2.2% 94.9% 0.004s 2.94e-06s C 1232 7 theano.sandbox.cuda.basic_ops.GpuReshape | |
2.1% 97.1% 0.003s 2.45e-06s C 1408 8 theano.compile.ops.Shape_i | |
1.7% 98.8% 0.003s 2.28e-06s C 1232 7 theano.tensor.opt.MakeVector | |
0.7% 99.5% 0.001s 3.15e-06s C 352 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.3% 99.8% 0.000s 2.45e-06s C 176 1 theano.tensor.elemwise.All | |
0.2% 100.0% 0.000s 2.15e-06s C 176 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
21.9% 21.9% 0.035s 4.02e-05s C 880 5 GpuDot22 | |
17.9% 39.8% 0.029s 4.10e-05s C 704 4 GpuGemm{inplace} | |
7.3% 47.1% 0.012s 1.66e-05s C 704 4 HostFromGpu | |
7.1% 54.2% 0.012s 1.31e-05s C 880 5 GpuFromHost | |
4.8% 59.0% 0.008s 4.41e-05s C 176 1 GpuAdvancedSubtensor1 | |
2.6% 61.7% 0.004s 2.42e-05s C 176 1 GpuCAReduce{maximum}{1,0} | |
2.3% 64.0% 0.004s 2.12e-05s C 176 1 GpuCAReduce{add}{1,0,0} | |
2.2% 66.2% 0.004s 2.02e-05s C 176 1 GpuCAReduce{add}{1,0} | |
2.2% 68.4% 0.003s 1.98e-05s C 176 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)] | |
2.1% 70.5% 0.003s 1.96e-05s C 176 1 GpuElemwise{mul,no_inplace} | |
2.1% 72.6% 0.003s 1.94e-05s C 176 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)] | |
2.1% 74.7% 0.003s 1.93e-05s C 176 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
2.1% 76.8% 0.003s 1.92e-05s C 176 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
2.1% 78.9% 0.003s 1.92e-05s C 176 1 GpuElemwise{Mul}[(0, 1)] | |
2.0% 80.9% 0.003s 1.86e-05s C 176 1 GpuElemwise{Add}[(0, 0)] | |
2.0% 83.0% 0.003s 1.86e-05s C 176 1 GpuElemwise{TrueDiv}[(0, 0)] | |
2.0% 85.0% 0.003s 1.83e-05s C 176 1 GpuElemwise{Sub}[(0, 1)] | |
2.0% 86.9% 0.003s 1.82e-05s C 176 1 GpuElemwise{Tanh}[(0, 0)] | |
1.9% 88.8% 0.003s 2.89e-06s C 1056 6 GpuReshape{2} | |
1.7% 90.6% 0.003s 2.28e-06s C 1232 7 MakeVector{dtype='int64'} | |
... (remaining 23 Ops account for 9.43%(0.02s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.4% 5.4% 0.009s 4.94e-05s 176 37 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
4.9% 10.3% 0.008s 4.54e-05s 176 29 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
4.9% 15.2% 0.008s 4.49e-05s 176 48 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.8% 20.1% 0.008s 4.41e-05s 176 12 GpuAdvancedSubtensor1(W, readout_sample_samples) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(1,), strides=(16,) | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.3% 24.3% 0.007s 3.92e-05s 176 24 GpuDot22(GpuFromHost.0, state_to_gates) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
4.1% 28.4% 0.007s 3.77e-05s 176 47 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.1% 32.6% 0.007s 3.77e-05s 176 57 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
4.0% 36.6% 0.007s 3.70e-05s 176 51 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.0% 40.6% 0.006s 3.69e-05s 176 49 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.0% 44.6% 0.006s 3.68e-05s 176 33 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
2.6% 47.3% 0.004s 2.42e-05s 176 59 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
2.3% 49.6% 0.004s 2.12e-05s 176 77 GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
2.2% 51.8% 0.004s 2.02e-05s 176 63 GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
2.2% 53.9% 0.003s 1.98e-05s 176 54 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 2: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
2.1% 56.1% 0.003s 1.96e-05s 176 46 GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.1% 58.2% 0.003s 1.94e-05s 176 50 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.1% 60.3% 0.003s 1.93e-05s 176 38 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
2.1% 62.4% 0.003s 1.92e-05s 176 61 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
input 2: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
2.1% 64.5% 0.003s 1.92e-05s 176 76 GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0) | |
input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0) | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
2.0% 66.5% 0.003s 1.86e-05s 176 72 GpuElemwise{TrueDiv}[(0, 0)](GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0, GpuElemwise{Add}[(0, 0)].0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
... (remaining 59 Apply instances account for 33.48%(0.05s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 1KB (1KB) | |
GPU: 14KB (16KB) | |
CPU + GPU: 15KB (18KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 1KB (1KB) | |
GPU: 14KB (16KB) | |
CPU + GPU: 15KB (18KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 1KB | |
GPU: 18KB | |
CPU + GPU: 20KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
80000B [(200, 100)] v GpuDimShuffle{0,1}(W) | |
80000B [(200, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] v GpuDimShuffle{0,1,2}(GpuFromHost.0) | |
9600B [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0) | |
9600B [(12, 1, 200)] c GpuFromHost(generator_generate_attended) | |
9600B [(12, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0) | |
4800B [(12, 1, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0) | |
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
4800B [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
... (remaining 67 Apply account for 13955B/241155B ((5.79%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137 | |
Time in 176 calls to Function.__call__: 1.020806e-01s | |
Time in Function.fn.__call__: 9.711361e-02s (95.134%) | |
Time in thunks: 4.424906e-02s (43.347%) | |
Total compile time: 5.741991e+00s | |
Number of Apply nodes: 14 | |
Theano Optimizer time: 1.551719e-01s | |
Theano validate time: 3.836393e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 6.299686e-02s | |
Import time 3.151894e-02s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.967s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
29.7% 29.7% 0.013s 1.87e-05s C 704 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
18.0% 47.7% 0.008s 4.51e-05s C 176 1 theano.sandbox.cuda.blas.GpuGemm | |
16.2% 63.9% 0.007s 2.04e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
15.7% 79.6% 0.007s 3.94e-05s C 176 1 theano.sandbox.cuda.blas.GpuDot22 | |
10.4% 90.0% 0.005s 1.31e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuFromHost | |
6.8% 96.8% 0.003s 1.71e-05s C 176 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
3.2% 100.0% 0.001s 2.67e-06s C 528 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
18.0% 18.0% 0.008s 4.51e-05s C 176 1 GpuGemm{inplace} | |
15.7% 33.6% 0.007s 3.94e-05s C 176 1 GpuDot22 | |
10.4% 44.0% 0.005s 1.31e-05s C 352 2 GpuFromHost | |
8.4% 52.4% 0.004s 2.10e-05s C 176 1 GpuCAReduce{maximum}{0,1} | |
8.0% 60.4% 0.004s 2.00e-05s C 176 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
7.9% 68.2% 0.003s 1.98e-05s C 176 1 GpuCAReduce{add}{0,1} | |
7.4% 75.6% 0.003s 1.86e-05s C 176 1 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)] | |
7.4% 83.0% 0.003s 1.86e-05s C 176 1 GpuElemwise{Add}[(0, 1)] | |
7.0% 90.0% 0.003s 1.76e-05s C 176 1 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)] | |
6.8% 96.8% 0.003s 1.71e-05s C 176 1 HostFromGpu | |
2.0% 98.8% 0.001s 2.55e-06s C 352 2 GpuDimShuffle{0,x} | |
1.2% 100.0% 0.001s 2.90e-06s C 176 1 GpuDimShuffle{x,0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
18.0% 18.0% 0.008s 4.51e-05s 176 4 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
15.7% 33.6% 0.007s 3.94e-05s 176 3 GpuDot22(GpuFromHost.0, W) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
8.4% 42.0% 0.004s 2.10e-05s 176 6 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
8.0% 49.9% 0.004s 2.00e-05s 176 8 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
7.9% 57.8% 0.003s 1.98e-05s 176 9 GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
7.4% 65.2% 0.003s 1.86e-05s 176 11 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0) | |
input 0: dtype=float32, shape=(1, 1), strides=(0, 0) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
output 0: dtype=float32, shape=(1, 1), strides=(0, 0) | |
7.4% 72.6% 0.003s 1.86e-05s 176 5 GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
7.0% 79.6% 0.003s 1.76e-05s 176 12 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
6.8% 86.4% 0.003s 1.71e-05s 176 13 HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 44), strides=c | |
6.0% 92.4% 0.003s 1.51e-05s 176 0 GpuFromHost(generator_generate_weighted_averages) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
4.4% 96.8% 0.002s 1.11e-05s 176 1 GpuFromHost(generator_generate_states) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
1.2% 98.0% 0.001s 2.90e-06s 176 2 GpuDimShuffle{x,0}(b) | |
input 0: dtype=float32, shape=(44,), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
1.1% 99.0% 0.000s 2.65e-06s 176 7 GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0) | |
input 0: dtype=float32, shape=(1,), strides=(0,) | |
output 0: dtype=float32, shape=(1, 1), strides=(0, 0) | |
1.0% 100.0% 0.000s 2.46e-06s 176 10 GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0) | |
input 0: dtype=float32, shape=(1,), strides=(0,) | |
output 0: dtype=float32, shape=(1, 1), strides=(0, 0) | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 1KB (1KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 1KB (1KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 1.502037e-05s | |
Time in Function.fn.__call__: 6.198883e-06s (41.270%) | |
Total compile time: 5.506201e+00s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 1.379991e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.969337e-04s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.970s | |
No execution time accumulated (hint: try config profiling.time_thunks=1) | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286 | |
Time in 6075 calls to Function.__call__: 3.570089e-01s | |
Time in Function.fn.__call__: 2.095068e-01s (58.684%) | |
Time in thunks: 3.889871e-02s (10.896%) | |
Total compile time: 7.101128e+00s | |
Number of Apply nodes: 2 | |
Theano Optimizer time: 1.691389e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.875090e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.970s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.039s 3.20e-06s C 12150 2 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.039s 3.20e-06s C 12150 2 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
59.5% 59.5% 0.023s 3.81e-06s 6075 0 DeepCopyOp(labels) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
40.5% 100.0% 0.016s 2.59e-06s 6075 1 DeepCopyOp(inputs) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 2 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253 | |
Time in 100 calls to Function.__call__: 9.018593e+01s | |
Time in Function.fn.__call__: 8.999730e+01s (99.791%) | |
Time in thunks: 3.194728e+01s (35.424%) | |
Total compile time: 3.881262e+02s | |
Number of Apply nodes: 3574 | |
Theano Optimizer time: 2.013044e+02s | |
Theano validate time: 4.291104e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.755457e+02s | |
Import time 1.107465e+01s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 830.971s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
82.1% 82.1% 26.232s 3.75e-02s Py 700 7 theano.scan_module.scan_op.Scan | |
5.7% 87.8% 1.812s 2.16e-05s C 83700 837 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.1% 90.9% 1.002s 1.00e-02s Py 100 1 lvsr.ops.EditDistanceOp | |
2.4% 93.3% 0.761s 3.08e-05s C 24700 247 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.0% 94.3% 0.330s 4.65e-05s C 7100 71 theano.sandbox.cuda.blas.GpuDot22 | |
1.0% 95.3% 0.313s 3.64e-06s C 86000 860 theano.tensor.elemwise.Elemwise | |
0.9% 96.2% 0.291s 1.82e-05s C 16000 160 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.5% 96.8% 0.173s 2.54e-05s C 6800 68 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.5% 97.3% 0.166s 2.30e-05s Py 7200 48 theano.ifelse.IfElse | |
0.4% 97.7% 0.142s 2.57e-05s C 5500 55 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.4% 98.2% 0.139s 3.32e-06s C 41800 418 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.4% 98.6% 0.129s 7.69e-06s C 16800 168 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.2% 98.8% 0.063s 2.11e-05s C 3000 30 theano.compile.ops.DeepCopyOp | |
0.1% 98.9% 0.048s 4.30e-06s C 11100 111 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 99.1% 0.048s 3.69e-06s C 12900 129 theano.tensor.opt.MakeVector | |
0.1% 99.2% 0.039s 1.68e-05s C 2300 23 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.1% 99.3% 0.036s 3.41e-06s C 10600 106 theano.compile.ops.Shape_i | |
0.1% 99.4% 0.030s 9.88e-05s Py 300 3 theano.sandbox.cuda.basic_ops.GpuSplit | |
0.1% 99.5% 0.026s 6.61e-05s C 400 4 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.1% 99.5% 0.025s 2.81e-06s C 8800 88 theano.tensor.basic.ScalarFromTensor | |
... (remaining 24 Classes account for 0.45%(0.14s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
29.2% 29.2% 9.321s 9.32e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan} | |
19.3% 48.5% 6.165s 6.16e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan&generator_generate_scan} | |
14.4% 62.9% 4.615s 2.31e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan} | |
11.5% 74.4% 3.680s 3.68e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan} | |
5.0% 79.4% 1.599s 1.60e-02s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
3.1% 82.6% 1.002s 1.00e-02s Py 100 1 EditDistanceOp | |
2.7% 85.2% 0.851s 8.51e-03s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
1.0% 86.3% 0.330s 4.65e-05s C 7100 71 GpuDot22 | |
1.0% 87.3% 0.319s 3.80e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1} | |
0.9% 88.2% 0.291s 1.82e-05s C 16000 160 HostFromGpu | |
0.6% 88.8% 0.207s 2.13e-05s C 9700 97 GpuElemwise{add,no_inplace} | |
0.5% 89.4% 0.172s 2.20e-05s C 7800 78 GpuElemwise{sub,no_inplace} | |
0.5% 89.9% 0.171s 3.57e-05s C 4800 48 GpuCAReduce{add}{1,1} | |
0.5% 90.4% 0.155s 2.38e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace} | |
0.5% 90.9% 0.154s 2.49e-05s Py 6200 39 if{gpu} | |
0.5% 91.3% 0.150s 2.34e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace} | |
0.4% 91.8% 0.134s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)] | |
0.4% 92.2% 0.133s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)] | |
0.4% 92.6% 0.133s 2.29e-05s C 5800 58 GpuElemwise{Switch,no_inplace} | |
0.4% 93.0% 0.131s 1.99e-05s C 6600 66 GpuElemwise{Mul}[(0, 0)] | |
... (remaining 262 Ops account for 6.99%(2.23s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
29.2% 29.2% 9.321s 9.32e-02s 100 2406 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 12), strides=c | |
input 2: dtype=float32, shape=(15, 10, 200), strides=c | |
input 3: dtype=float32, shape=(15, 10, 100), strides=c | |
input 4: dtype=float32, shape=(15, 10, 100), strides=c | |
input 5: dtype=float32, shape=(15, 10, 100), strides=c | |
input 6: dtype=float32, shape=(15, 10, 1), strides=c | |
input 7: dtype=float32, shape=(15, 10, 200), strides=c | |
input 8: dtype=float32, shape=(15, 10, 12), strides=c | |
input 9: dtype=float32, shape=(15, 10, 200), strides=c | |
input 10: dtype=float32, shape=(15, 10, 100), strides=c | |
input 11: dtype=float32, shape=(15, 10, 100), strides=c | |
input 12: dtype=float32, shape=(15, 10, 100), strides=c | |
input 13: dtype=float32, shape=(15, 10, 200), strides=c | |
input 14: dtype=float32, shape=(16, 10, 100), strides=c | |
input 15: dtype=float32, shape=(16, 10, 200), strides=c | |
input 16: dtype=float32, shape=(16, 10, 12), strides=c | |
input 17: dtype=float32, shape=(16, 10, 100), strides=c | |
input 18: dtype=float32, shape=(16, 10, 200), strides=c | |
input 19: dtype=float32, shape=(16, 10, 12), strides=c | |
input 20: dtype=float32, shape=(2, 100, 1), strides=c | |
input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
input 23: dtype=float32, shape=(2, 100, 1), strides=c | |
input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
input 26: dtype=int64, shape=(), strides=c | |
input 27: dtype=int64, shape=(), strides=c | |
input 28: dtype=int64, shape=(), strides=c | |
input 29: dtype=int64, shape=(), strides=c | |
input 30: dtype=int64, shape=(), strides=c | |
input 31: dtype=int64, shape=(), strides=c | |
input 32: dtype=int64, shape=(), strides=c | |
input 33: dtype=int64, shape=(), strides=c | |
input 34: dtype=float32, shape=(100, 200), strides=c | |
input 35: dtype=float32, shape=(200, 200), strides=c | |
input 36: dtype=float32, shape=(100, 100), strides=c | |
input 37: dtype=float32, shape=(200, 100), strides=c | |
input 38: dtype=float32, shape=(100, 100), strides=c | |
input 39: dtype=float32, shape=(200, 200), strides=c | |
input 40: dtype=float32, shape=(200, 100), strides=c | |
input 41: dtype=float32, shape=(100, 100), strides=c | |
input 42: dtype=float32, shape=(100, 200), strides=c | |
input 43: dtype=float32, shape=(100, 100), strides=c | |
input 44: dtype=int64, shape=(2,), strides=c | |
input 45: dtype=float32, shape=(12, 10, 100), strides=c | |
input 46: dtype=int64, shape=(1,), strides=c | |
input 47: dtype=float32, shape=(12, 10), strides=c | |
input 48: dtype=float32, shape=(12, 10, 200), strides=c | |
input 49: dtype=float32, shape=(100, 1), strides=c | |
input 50: dtype=int8, shape=(10,), strides=c | |
input 51: dtype=float32, shape=(1, 100), strides=c | |
input 52: dtype=float32, shape=(100, 200), strides=c | |
input 53: dtype=float32, shape=(200, 200), strides=c | |
input 54: dtype=float32, shape=(100, 100), strides=c | |
input 55: dtype=float32, shape=(200, 100), strides=c | |
input 56: dtype=float32, shape=(100, 100), strides=c | |
input 57: dtype=float32, shape=(200, 200), strides=c | |
input 58: dtype=float32, shape=(200, 100), strides=c | |
input 59: dtype=float32, shape=(100, 100), strides=c | |
input 60: dtype=float32, shape=(100, 200), strides=c | |
input 61: dtype=float32, shape=(100, 100), strides=c | |
input 62: dtype=int64, shape=(2,), strides=c | |
input 63: dtype=float32, shape=(12, 10, 100), strides=c | |
input 64: dtype=int64, shape=(1,), strides=c | |
input 65: dtype=float32, shape=(12, 10), strides=c | |
input 66: dtype=float32, shape=(12, 10, 200), strides=c | |
input 67: dtype=float32, shape=(100, 1), strides=c | |
input 68: dtype=int8, shape=(10,), strides=c | |
input 69: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(16, 10, 100), strides=c | |
output 1: dtype=float32, shape=(16, 10, 200), strides=c | |
output 2: dtype=float32, shape=(16, 10, 12), strides=c | |
output 3: dtype=float32, shape=(16, 10, 100), strides=c | |
output 4: dtype=float32, shape=(16, 10, 200), strides=c | |
output 5: dtype=float32, shape=(16, 10, 12), strides=c | |
output 6: dtype=float32, shape=(2, 100, 1), strides=c | |
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
output 9: dtype=float32, shape=(2, 100, 1), strides=c | |
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
output 12: dtype=float32, shape=(15, 10, 100), strides=c | |
output 13: dtype=float32, shape=(15, 10, 200), strides=c | |
output 14: dtype=float32, shape=(15, 10, 100), strides=c | |
output 15: dtype=float32, shape=(15, 100, 10), strides=c | |
output 16: dtype=float32, shape=(15, 10, 100), strides=c | |
output 17: dtype=float32, shape=(15, 10, 200), strides=c | |
output 18: dtype=float32, shape=(15, 10, 100), strides=c | |
output 19: dtype=float32, shape=(15, 100, 10), strides=c | |
19.3% 48.5% 6.165s 6.16e-02s 100 1795 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=c | |
input 2: dtype=float32, shape=(1, 10, 200), strides=c | |
input 3: dtype=float32, shape=(1, 92160), strides=c | |
input 4: dtype=float32, shape=(1, 10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 10, 200), strides=c | |
input 6: dtype=float32, shape=(2, 92160), strides=c | |
input 7: dtype=int64, shape=(), strides=c | |
input 8: dtype=int64, shape=(), strides=c | |
input 9: dtype=float32, shape=(100, 44), strides=c | |
input 10: dtype=float32, shape=(200, 44), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(200, 200), strides=c | |
input 13: dtype=float32, shape=(45, 100), strides=c | |
input 14: dtype=float32, shape=(100, 200), strides=c | |
input 15: dtype=float32, shape=(100, 100), strides=c | |
input 16: dtype=float32, shape=(200, 100), strides=c | |
input 17: dtype=float32, shape=(100, 100), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(1, 44), strides=c | |
input 20: dtype=float32, shape=(1, 200), strides=c | |
input 21: dtype=float32, shape=(1, 100), strides=c | |
input 22: dtype=int64, shape=(1,), strides=c | |
input 23: dtype=float32, shape=(12, 10), strides=c | |
input 24: dtype=float32, shape=(12, 10, 200), strides=c | |
input 25: dtype=float32, shape=(100, 1), strides=c | |
input 26: dtype=int8, shape=(10,), strides=c | |
input 27: dtype=float32, shape=(12, 10, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10, 200), strides=c | |
input 29: dtype=float32, shape=(12, 10, 100), strides=c | |
output 0: dtype=float32, shape=(1, 10, 100), strides=c | |
output 1: dtype=float32, shape=(1, 10, 200), strides=c | |
output 2: dtype=float32, shape=(1, 92160), strides=c | |
output 3: dtype=float32, shape=(1, 10, 100), strides=c | |
output 4: dtype=float32, shape=(1, 10, 200), strides=c | |
output 5: dtype=float32, shape=(2, 92160), strides=c | |
output 6: dtype=int64, shape=(15, 10), strides=c | |
output 7: dtype=int64, shape=(15, 10), strides=c | |
11.5% 60.0% 3.680s 3.68e-02s 100 2157 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64 | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 200), strides=c | |
input 2: dtype=float32, shape=(15, 10, 100), strides=c | |
input 3: dtype=float32, shape=(15, 10, 1), strides=c | |
input 4: dtype=float32, shape=(15, 10, 200), strides=c | |
input 5: dtype=float32, shape=(15, 10, 100), strides=c | |
input 6: dtype=float32, shape=(16, 10, 100), strides=c | |
input 7: dtype=float32, shape=(16, 10, 200), strides=c | |
input 8: dtype=float32, shape=(16, 10, 12), strides=c | |
input 9: dtype=float32, shape=(16, 10, 100), strides=c | |
input 10: dtype=float32, shape=(16, 10, 200), strides=c | |
input 11: dtype=float32, shape=(16, 10, 12), strides=c | |
input 12: dtype=float32, shape=(100, 200), strides=c | |
input 13: dtype=float32, shape=(200, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(200, 100), strides=c | |
input 16: dtype=float32, shape=(100, 100), strides=c | |
input 17: dtype=float32, shape=(12, 10), strides=c | |
input 18: dtype=float32, shape=(12, 10, 100), strides=c | |
input 19: dtype=int64, shape=(1,), strides=c | |
input 20: dtype=float32, shape=(12, 10, 200), strides=c | |
input 21: dtype=int8, shape=(10,), strides=c | |
input 22: dtype=float32, shape=(100, 1), strides=c | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(200, 200), strides=c | |
input 25: dtype=float32, shape=(100, 100), strides=c | |
input 26: dtype=float32, shape=(200, 100), strides=c | |
input 27: dtype=float32, shape=(100, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10), strides=c | |
input 29: dtype=float32, shape=(12, 10, 100), strides=c | |
input 30: dtype=int64, shape=(1,), strides=c | |
input 31: dtype=float32, shape=(12, 10, 200), strides=c | |
input 32: dtype=int8, shape=(10,), strides=c | |
input 33: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(16, 10, 100), strides=c | |
output 1: dtype=float32, shape=(16, 10, 200), strides=c | |
output 2: dtype=float32, shape=(16, 10, 12), strides=c | |
output 3: dtype=float32, shape=(16, 10, 100), strides=c | |
output 4: dtype=float32, shape=(16, 10, 200), strides=c | |
output 5: dtype=float32, shape=(16, 10, 12), strides=c | |
7.2% 67.2% 2.311s 2.31e-02s 100 2602 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 100), strides=c | |
input 4: dtype=float32, shape=(12, 10, 1), strides=c | |
input 5: dtype=float32, shape=(12, 10, 200), strides=c | |
input 6: dtype=float32, shape=(12, 10, 100), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(12, 10, 1), strides=c | |
input 9: dtype=float32, shape=(13, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=int64, shape=(), strides=c | |
input 12: dtype=int64, shape=(), strides=c | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=float32, shape=(100, 200), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(200, 100), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(100, 200), strides=c | |
input 22: dtype=float32, shape=(100, 100), strides=c | |
input 23: dtype=float32, shape=(200, 100), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(12, 10, 200), strides=c | |
output 4: dtype=float32, shape=(12, 100, 10), strides=c | |
output 5: dtype=float32, shape=(12, 10, 100), strides=c | |
output 6: dtype=float32, shape=(12, 10, 200), strides=c | |
output 7: dtype=float32, shape=(12, 100, 10), strides=c | |
7.2% 74.4% 2.305s 2.30e-02s 100 2603 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0 | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 100), strides=c | |
input 4: dtype=float32, shape=(12, 10, 1), strides=c | |
input 5: dtype=float32, shape=(12, 10, 200), strides=c | |
input 6: dtype=float32, shape=(12, 10, 100), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(12, 10, 1), strides=c | |
input 9: dtype=float32, shape=(13, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=int64, shape=(), strides=c | |
input 12: dtype=int64, shape=(), strides=c | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=float32, shape=(100, 200), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(200, 100), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(100, 200), strides=c | |
input 22: dtype=float32, shape=(100, 100), strides=c | |
input 23: dtype=float32, shape=(200, 100), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(12, 10, 200), strides=c | |
output 4: dtype=float32, shape=(12, 100, 10), strides=c | |
output 5: dtype=float32, shape=(12, 10, 100), strides=c | |
output 6: dtype=float32, shape=(12, 10, 200), strides=c | |
output 7: dtype=float32, shape=(12, 100, 10), strides=c | |
5.0% 79.4% 1.599s 1.60e-02s 100 1601 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 1), strides=c | |
input 4: dtype=float32, shape=(12, 10, 200), strides=c | |
input 5: dtype=float32, shape=(12, 10, 100), strides=c | |
input 6: dtype=float32, shape=(12, 10, 1), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(13, 10, 100), strides=c | |
input 9: dtype=float32, shape=(12, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(13, 10, 100), strides=c | |
3.1% 82.6% 1.002s 1.00e-02s 100 1861 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
input 1: dtype=float32, shape=(15, 10), strides=c | |
input 2: dtype=int64, shape=(12, 10), strides=c | |
input 3: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10, 1), strides=c | |
2.7% 85.2% 0.851s 8.51e-03s 100 1611 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 1), strides=c | |
input 4: dtype=float32, shape=(12, 10, 200), strides=c | |
input 5: dtype=float32, shape=(12, 10, 100), strides=c | |
input 6: dtype=float32, shape=(12, 10, 1), strides=c | |
input 7: dtype=float32, shape=(13, 10, 100), strides=c | |
input 8: dtype=float32, shape=(13, 10, 100), strides=c | |
input 9: dtype=float32, shape=(100, 200), strides=c | |
input 10: dtype=float32, shape=(100, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
0.0% 85.3% 0.011s 1.11e-04s 100 2572 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(12, 10, 100), strides=c | |
0.0% 85.3% 0.010s 1.05e-04s 100 2573 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(12, 10, 100), strides=c | |
0.0% 85.3% 0.008s 8.06e-05s 100 2356 GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(15, 10), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(14, 10), strides=c | |
output 1: dtype=float32, shape=(1, 10), strides=c | |
0.0% 85.4% 0.007s 7.49e-05s 100 1739 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 100), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
0.0% 85.4% 0.007s 7.41e-05s 100 1731 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 100), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
0.0% 85.4% 0.007s 7.37e-05s 100 1682 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 100), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
0.0% 85.4% 0.007s 7.11e-05s 100 2477 GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
0.0% 85.5% 0.007s 6.96e-05s 100 3110 GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
0.0% 85.5% 0.007s 6.84e-05s 100 2488 GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
0.0% 85.5% 0.007s 6.74e-05s 100 3370 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
0.0% 85.5% 0.007s 6.71e-05s 100 3367 GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
0.0% 85.5% 0.007s 6.63e-05s 100 3565 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0) | |
input 0: dtype=float32, shape=(200, 200), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 3554 Apply instances account for 14.46%(4.62s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 57KB (61KB) | |
GPU: 4979KB (6661KB) | |
CPU + GPU: 5035KB (6721KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 56KB (61KB) | |
GPU: 6160KB (7107KB) | |
CPU + GPU: 6216KB (7167KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 115KB | |
GPU: 16958KB | |
CPU + GPU: 17073KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0) | |
1132320B [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
399360B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12)] i i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0) | |
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan&generator_generate_scan}.5, ScalarFromTensor.0) | |
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>) | |
368640B [(1, 92160)] c GpuAllocEmpty(TensorConstant{1}, Shape_i{0}.0) | |
368640B [(1, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, GpuDimShuffle{x,0}.0, Constant{1}) | |
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0) | |
200000B [(12, 10, 100), (13, 10, 100), (12, 10, 100), (13, 10, 100)] i i i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0) | |
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0) | |
160000B [(200, 200)] v GpuDimShuffle{1,0}(W) | |
160000B [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0) | |
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0) | |
160000B [(200, 200)] v GpuDimShuffle{1,0}(W) | |
160000B [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{Composite{EQ(i0, Switch(i1, (i2 // (-(i3 * i4 * i5))), i3))}}.0) | |
... (remaining 3554 Apply account for 51935459B/60721859B ((85.53%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 1.585470e+00s | |
Total time spent in calling the VM 1.543819e+00s (97.373%) | |
Total overhead (computing slices..) 4.165101e-02s (2.627%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
57.3% 57.3% 0.500s 5.21e-05s C 9600 8 theano.sandbox.cuda.blas.GpuGemm | |
39.2% 96.5% 0.343s 2.04e-05s C 16800 14 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.5% 100.0% 0.031s 3.19e-06s C 9600 8 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
57.3% 57.3% 0.500s 5.21e-05s C 9600 8 GpuGemm{no_inplace} | |
12.5% 69.7% 0.109s 2.27e-05s C 4800 4 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
10.8% 80.5% 0.094s 1.96e-05s C 4800 4 GpuElemwise{mul,no_inplace} | |
10.6% 91.1% 0.093s 1.93e-05s C 4800 4 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
5.4% 96.5% 0.047s 1.96e-05s C 2400 2 GpuElemwise{sub,no_inplace} | |
1.9% 98.4% 0.017s 3.49e-06s C 4800 4 GpuSubtensor{::, :int64:} | |
1.6% 100.0% 0.014s 2.89e-06s C 4800 4 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
7.3% 7.3% 0.063s 5.28e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
7.2% 14.5% 0.063s 5.25e-05s 1200 4 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
7.2% 21.7% 0.063s 5.25e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
7.2% 28.8% 0.063s 5.21e-05s 1200 5 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
7.1% 36.0% 0.062s 5.18e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.1% 43.1% 0.062s 5.17e-05s 1200 23 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.1% 50.2% 0.062s 5.16e-05s 1200 25 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.1% 57.3% 0.062s 5.16e-05s 1200 24 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.2% 60.4% 0.028s 2.31e-05s 1200 26 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.1% 63.5% 0.027s 2.27e-05s 1200 27 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.1% 66.7% 0.027s 2.26e-05s 1200 28 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.1% 69.7% 0.027s 2.24e-05s 1200 29 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.9% 72.6% 0.025s 2.09e-05s 1200 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 1), strides=c | |
2.7% 75.3% 0.024s 1.97e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 78.0% 0.024s 1.96e-05s 1200 21 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 80.7% 0.023s 1.96e-05s 1200 20 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 83.4% 0.023s 1.95e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 86.1% 0.023s 1.95e-05s 1200 6 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.7% 88.7% 0.023s 1.93e-05s 1200 8 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.6% 91.3% 0.023s 1.92e-05s 1200 7 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
... (remaining 10 Apply instances account for 8.66%(0.08s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 27KB (51KB) | |
CPU + GPU: 27KB (51KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 27KB (51KB) | |
CPU + GPU: 27KB (51KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 78KB | |
CPU + GPU: 78KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
... (remaining 10 Apply account for 32080B/144080B ((22.27%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 8.424067e-01s | |
Total time spent in calling the VM 8.255837e-01s (98.003%) | |
Total overhead (computing slices..) 1.682305e-02s (1.997%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
54.5% 54.5% 0.248s 5.17e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm | |
42.3% 96.7% 0.193s 2.01e-05s C 9600 8 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.3% 100.0% 0.015s 3.10e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
54.5% 54.5% 0.248s 5.17e-05s C 4800 4 GpuGemm{no_inplace} | |
11.7% 66.2% 0.054s 2.23e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
10.3% 76.5% 0.047s 1.95e-05s C 2400 2 GpuElemwise{mul,no_inplace} | |
10.2% 86.7% 0.047s 1.94e-05s C 2400 2 GpuElemwise{sub,no_inplace} | |
10.0% 96.7% 0.046s 1.91e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
1.8% 98.5% 0.008s 3.35e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
1.5% 100.0% 0.007s 2.85e-06s C 2400 2 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
13.8% 13.8% 0.063s 5.24e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
13.6% 27.4% 0.062s 5.17e-05s 1200 3 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
13.6% 40.9% 0.062s 5.15e-05s 1200 13 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
13.5% 54.5% 0.062s 5.14e-05s 1200 12 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.9% 60.4% 0.027s 2.25e-05s 1200 14 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.8% 66.2% 0.027s 2.22e-05s 1200 15 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.4% 71.6% 0.025s 2.04e-05s 1200 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 1), strides=c | |
5.1% 76.7% 0.023s 1.95e-05s 1200 10 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.1% 81.9% 0.023s 1.95e-05s 1200 11 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.0% 86.9% 0.023s 1.91e-05s 1200 4 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
5.0% 91.9% 0.023s 1.90e-05s 1200 5 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
4.8% 96.7% 0.022s 1.84e-05s 1200 2 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 1), strides=c | |
0.9% 97.6% 0.004s 3.39e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
0.9% 98.5% 0.004s 3.31e-06s 1200 8 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
0.8% 99.3% 0.004s 2.97e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
0.7% 100.0% 0.003s 2.74e-06s 1200 9 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 39KB | |
CPU + GPU: 39KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
... (remaining 2 Apply account for 80B/72080B ((0.11%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( generator_generate_scan&generator_generate_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 6.135115e+00s | |
Total time spent in calling the VM 5.882160e+00s (95.877%) | |
Total overhead (computing slices..) 2.529552e-01s (4.123%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
26.5% 26.5% 0.788s 2.02e-05s C 39000 26 theano.sandbox.cuda.basic_ops.GpuElemwise | |
22.6% 49.0% 0.672s 4.48e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm | |
19.1% 68.1% 0.569s 3.79e-05s C 15000 10 theano.sandbox.cuda.blas.GpuDot22 | |
10.9% 79.0% 0.325s 2.16e-05s C 15000 10 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
4.5% 83.6% 0.135s 4.51e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
4.0% 87.5% 0.118s 3.93e-05s C 3000 2 theano.sandbox.rng_mrg.GPU_mrg_uniform | |
3.8% 91.3% 0.113s 1.89e-05s C 6000 4 theano.sandbox.cuda.basic_ops.HostFromGpu | |
1.8% 93.1% 0.052s 2.49e-06s C 21000 14 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.7% 94.8% 0.050s 1.67e-05s C 3000 2 theano.tensor.basic.MaxAndArgmax | |
1.1% 95.9% 0.034s 2.25e-06s C 15000 10 theano.compile.ops.Shape_i | |
1.0% 96.9% 0.029s 3.17e-06s C 9000 6 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.8% 97.7% 0.023s 1.55e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.8% 98.4% 0.023s 1.90e-06s C 12000 8 theano.tensor.opt.MakeVector | |
0.7% 99.1% 0.020s 3.31e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.4% 99.5% 0.013s 4.27e-06s C 3000 2 theano.sandbox.multinomial.MultinomialFromUniform | |
0.3% 99.8% 0.009s 2.11e-06s C 4500 3 theano.tensor.elemwise.Elemwise | |
0.2% 100.0% 0.005s 3.28e-06s C 1500 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
22.6% 22.6% 0.672s 4.48e-05s C 15000 10 GpuGemm{inplace} | |
19.1% 41.7% 0.569s 3.79e-05s C 15000 10 GpuDot22 | |
4.5% 46.2% 0.135s 4.51e-05s C 3000 2 GpuAdvancedSubtensor1 | |
4.2% 50.4% 0.124s 2.07e-05s C 6000 4 GpuElemwise{mul,no_inplace} | |
4.0% 54.3% 0.118s 3.93e-05s C 3000 2 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace} | |
3.8% 58.1% 0.113s 1.89e-05s C 6000 4 HostFromGpu | |
2.6% 60.7% 0.076s 2.54e-05s C 3000 2 GpuCAReduce{maximum}{1,0} | |
2.5% 63.2% 0.076s 2.52e-05s C 3000 2 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace} | |
2.3% 65.6% 0.070s 2.33e-05s C 3000 2 GpuCAReduce{add}{1,0,0} | |
2.1% 67.7% 0.063s 2.10e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
2.1% 69.8% 0.062s 2.07e-05s C 3000 2 GpuElemwise{add,no_inplace} | |
2.1% 71.8% 0.061s 2.04e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
2.0% 73.9% 0.061s 2.02e-05s C 3000 2 GpuCAReduce{maximum}{0,1} | |
2.0% 75.9% 0.060s 1.99e-05s C 3000 2 GpuCAReduce{add}{1,0} | |
2.0% 77.9% 0.059s 1.96e-05s C 3000 2 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
2.0% 79.8% 0.059s 1.96e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)] | |
1.9% 81.8% 0.058s 1.93e-05s C 3000 2 GpuCAReduce{add}{0,1} | |
1.9% 83.7% 0.057s 1.91e-05s C 3000 2 GpuElemwise{Add}[(0, 1)] | |
1.9% 85.6% 0.057s 1.91e-05s C 3000 2 GpuElemwise{Add}[(0, 0)] | |
1.9% 87.5% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)] | |
... (remaining 21 Ops account for 12.45%(0.37s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
2.4% 2.4% 0.073s 4.84e-05s 1500 20 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.4% 4.9% 0.072s 4.82e-05s 1500 21 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.4% 7.3% 0.072s 4.81e-05s 1500 75 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 9.7% 0.072s 4.80e-05s 1500 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 12.1% 0.071s 4.75e-05s 1500 16 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 44), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
2.4% 14.5% 0.071s 4.74e-05s 1500 18 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 44), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
2.3% 16.8% 0.069s 4.59e-05s 1500 57 GpuAdvancedSubtensor1(W_copy01[cuda], argmax) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(10,), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.2% 19.0% 0.067s 4.44e-05s 1500 59 GpuAdvancedSubtensor1(W_copy01[cuda], argmax) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(10,), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.1% 21.2% 0.064s 4.25e-05s 1500 1 GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
2.0% 23.2% 0.061s 4.05e-05s 1500 64 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.0% 25.3% 0.061s 4.04e-05s 1500 63 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.0% 27.3% 0.059s 3.96e-05s 1500 77 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.0% 29.2% 0.059s 3.95e-05s 1500 78 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.0% 31.2% 0.059s 3.94e-05s 1500 28 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(92160,), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(92160,), strides=c | |
output 1: dtype=float32, shape=(10,), strides=c | |
2.0% 33.2% 0.059s 3.92e-05s 1500 26 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(92160,), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(92160,), strides=c | |
output 1: dtype=float32, shape=(10,), strides=c | |
1.9% 35.1% 0.058s 3.84e-05s 1500 8 GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.9% 37.1% 0.057s 3.83e-05s 1500 3 GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
1.9% 39.0% 0.057s 3.82e-05s 1500 9 GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.9% 40.9% 0.056s 3.72e-05s 1500 82 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.9% 42.7% 0.056s 3.72e-05s 1500 81 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
... (remaining 95 Apply instances account for 57.27%(1.71s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 9KB (9KB) | |
GPU: 837KB (923KB) | |
CPU + GPU: 846KB (932KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 9KB (9KB) | |
GPU: 837KB (923KB) | |
CPU + GPU: 846KB (932KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 11KB | |
GPU: 1080KB | |
CPU + GPU: 1091KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda]) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda]) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda]) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda]) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda]) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
... (remaining 95 Apply account for 138458B/1515818B ((9.13%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 3.657264e+00s | |
Total time spent in calling the VM 3.536357e+00s (96.694%) | |
Total overhead (computing slices..) 1.209071e-01s (3.306%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
34.1% 34.1% 0.573s 2.01e-05s C 28500 19 theano.sandbox.cuda.basic_ops.GpuElemwise | |
26.7% 60.8% 0.448s 3.73e-05s C 12000 8 theano.sandbox.cuda.blas.GpuDot22 | |
17.2% 77.9% 0.289s 4.81e-05s C 6000 4 theano.sandbox.cuda.blas.GpuGemm | |
11.4% 89.3% 0.191s 2.13e-05s C 9000 6 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
2.7% 92.1% 0.046s 1.53e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuFromHost | |
2.5% 94.6% 0.042s 2.33e-06s C 18000 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.2% 95.7% 0.020s 2.20e-06s C 9000 6 theano.compile.ops.Shape_i | |
1.1% 96.9% 0.019s 3.20e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
1.1% 98.0% 0.019s 3.11e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.8% 98.8% 0.013s 2.15e-06s C 6000 4 theano.tensor.elemwise.Elemwise | |
0.7% 99.4% 0.011s 1.84e-06s C 6000 4 theano.tensor.opt.MakeVector | |
0.6% 100.0% 0.010s 3.21e-06s C 3000 2 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
26.7% 26.7% 0.448s 3.73e-05s C 12000 8 GpuDot22 | |
17.2% 43.8% 0.289s 4.81e-05s C 6000 4 GpuGemm{inplace} | |
7.3% 51.1% 0.122s 2.04e-05s C 6000 4 GpuElemwise{mul,no_inplace} | |
4.3% 55.4% 0.072s 2.42e-05s C 3000 2 GpuCAReduce{maximum}{1,0} | |
4.2% 59.7% 0.071s 2.36e-05s C 3000 2 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace} | |
3.6% 63.3% 0.061s 2.02e-05s C 3000 2 GpuElemwise{add,no_inplace} | |
3.6% 66.9% 0.060s 2.01e-05s C 3000 2 GpuCAReduce{add}{1,0,0} | |
3.6% 70.4% 0.060s 2.00e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
3.5% 74.0% 0.059s 1.97e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
3.5% 77.4% 0.059s 1.96e-05s C 3000 2 GpuCAReduce{add}{1,0} | |
3.4% 80.9% 0.058s 1.92e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)] | |
3.3% 84.2% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Add}[(0, 0)] | |
3.3% 87.5% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Tanh}[(0, 0)] | |
2.7% 90.3% 0.046s 1.53e-05s C 3000 2 GpuFromHost | |
1.8% 92.1% 0.030s 2.03e-05s C 1500 1 GpuElemwise{sub,no_inplace} | |
1.1% 93.2% 0.019s 3.11e-06s C 6000 4 GpuReshape{2} | |
0.8% 94.0% 0.014s 2.35e-06s C 6000 4 GpuDimShuffle{x,0} | |
0.7% 94.7% 0.011s 1.84e-06s C 6000 4 MakeVector{dtype='int64'} | |
0.6% 95.3% 0.011s 3.54e-06s C 3000 2 GpuSubtensor{::, :int64:} | |
0.6% 95.9% 0.010s 3.21e-06s C 3000 2 InplaceDimShuffle{x,0} | |
... (remaining 10 Ops account for 4.11%(0.07s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
4.3% 4.3% 0.073s 4.86e-05s 1500 14 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
4.3% 8.6% 0.072s 4.82e-05s 1500 17 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
4.3% 12.9% 0.072s 4.78e-05s 1500 35 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
4.3% 17.2% 0.072s 4.78e-05s 1500 36 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.5% 20.6% 0.058s 3.87e-05s 1500 11 GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
3.4% 24.0% 0.057s 3.82e-05s 1500 5 GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
3.3% 27.4% 0.056s 3.74e-05s 1500 39 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.3% 30.7% 0.056s 3.71e-05s 1500 33 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.3% 34.0% 0.056s 3.70e-05s 1500 34 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.3% 37.3% 0.055s 3.69e-05s 1500 40 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.3% 40.6% 0.055s 3.67e-05s 1500 50 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=c | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=c | |
3.3% 43.8% 0.055s 3.66e-05s 1500 49 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=c | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=c | |
2.2% 46.0% 0.036s 2.43e-05s 1500 53 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=float32, shape=(10,), strides=c | |
2.1% 48.2% 0.036s 2.40e-05s 1500 54 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=float32, shape=(10,), strides=c | |
2.1% 50.3% 0.036s 2.37e-05s 1500 37 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 1), strides=c | |
input 6: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.1% 52.4% 0.035s 2.35e-05s 1500 38 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 1), strides=c | |
input 6: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.9% 54.3% 0.032s 2.13e-05s 1500 71 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
input 0: dtype=float32, shape=(12, 10, 1), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
1.9% 56.2% 0.032s 2.11e-05s 1500 72 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
input 0: dtype=float32, shape=(12, 10, 1), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
1.8% 58.0% 0.031s 2.04e-05s 1500 43 GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
1.8% 59.8% 0.030s 2.03e-05s 1500 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 1), strides=c | |
... (remaining 55 Apply instances account for 40.20%(0.68s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 118KB (204KB) | |
CPU + GPU: 118KB (204KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 118KB (204KB) | |
CPU + GPU: 118KB (204KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 345KB | |
CPU + GPU: 345KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace1[cuda]) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace0[cuda]) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda]) | |
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda]) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0) | |
... (remaining 55 Apply account for 62508B/710508B ((8.80%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 9.275899e+00s | |
Total time spent in calling the VM 9.022675e+00s (97.270%) | |
Total overhead (computing slices..) 2.532237e-01s (2.730%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
40.7% 40.7% 1.753s 1.98e-05s C 88500 59 theano.sandbox.cuda.basic_ops.GpuElemwise | |
20.9% 61.6% 0.901s 3.76e-05s C 24000 16 theano.sandbox.cuda.blas.GpuDot22 | |
17.1% 78.7% 0.735s 4.90e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm | |
8.7% 87.4% 0.377s 2.09e-05s C 18000 12 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
2.8% 90.2% 0.122s 2.03e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
2.3% 92.6% 0.099s 2.37e-06s C 42000 28 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
2.0% 94.5% 0.085s 1.41e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuFromHost | |
1.3% 95.8% 0.057s 1.90e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
1.1% 97.0% 0.049s 3.25e-06s C 15000 10 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.8% 97.8% 0.036s 2.00e-06s C 18000 12 theano.compile.ops.Shape_i | |
0.7% 98.5% 0.028s 2.36e-06s C 12000 8 theano.tensor.elemwise.Elemwise | |
0.6% 99.0% 0.024s 1.99e-06s C 12000 8 theano.tensor.opt.MakeVector | |
0.5% 99.5% 0.022s 3.62e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.5% 100.0% 0.020s 3.38e-06s C 6000 4 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
20.9% 20.9% 0.901s 3.76e-05s C 24000 16 GpuDot22 | |
13.2% 34.1% 0.568s 4.74e-05s C 12000 8 GpuGemm{inplace} | |
6.9% 41.1% 0.299s 1.99e-05s C 15000 10 GpuElemwise{mul,no_inplace} | |
4.1% 45.2% 0.176s 1.96e-05s C 9000 6 GpuCAReduce{add}{1,0} | |
3.9% 49.0% 0.168s 1.86e-05s C 9000 6 GpuElemwise{Add}[(0, 0)] | |
3.9% 52.9% 0.167s 5.55e-05s C 3000 2 GpuGemm{no_inplace} | |
2.8% 55.7% 0.119s 1.98e-05s C 6000 4 GpuElemwise{Add}[(0, 1)] | |
2.6% 58.3% 0.114s 1.90e-05s C 6000 4 GpuElemwise{add,no_inplace} | |
2.0% 60.3% 0.085s 1.41e-05s C 6000 4 GpuFromHost | |
1.8% 62.1% 0.078s 2.61e-05s C 3000 2 GpuCAReduce{maximum}{1,0} | |
1.6% 63.7% 0.070s 2.32e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace} | |
1.5% 65.3% 0.067s 2.23e-05s C 3000 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
1.5% 66.8% 0.067s 2.22e-05s C 3000 2 GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)] | |
1.5% 68.3% 0.064s 2.14e-05s C 3000 2 GpuElemwise{Composite{tanh((i0 + i1))},no_inplace} | |
1.5% 69.8% 0.063s 2.11e-05s C 3000 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
1.4% 71.2% 0.062s 2.07e-05s C 3000 2 GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace} | |
1.4% 72.6% 0.061s 2.05e-05s C 3000 2 GpuCAReduce{add}{0,0,1} | |
1.4% 74.1% 0.061s 2.03e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
1.4% 75.5% 0.061s 2.02e-05s C 3000 2 GpuCAReduce{add}{1,0,0} | |
1.4% 76.9% 0.060s 2.02e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)] | |
... (remaining 33 Ops account for 23.13%(1.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
2.0% 2.0% 0.085s 5.69e-05s 1500 151 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.9% 3.9% 0.081s 5.41e-05s 1500 152 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.8% 5.6% 0.077s 5.10e-05s 1500 172 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.8% 7.4% 0.076s 5.09e-05s 1500 174 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.7% 9.1% 0.073s 4.86e-05s 1500 83 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.7% 10.8% 0.073s 4.85e-05s 1500 27 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.7% 12.5% 0.073s 4.84e-05s 1500 36 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.7% 14.2% 0.072s 4.80e-05s 1500 81 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.6% 15.8% 0.070s 4.70e-05s 1500 171 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 200), strides=(1, 200) | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.6% 17.4% 0.070s 4.69e-05s 1500 173 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 200), strides=(1, 200) | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.5% 18.9% 0.063s 4.20e-05s 1500 132 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 120), strides=(1, 100) | |
input 1: dtype=float32, shape=(120, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(100, 1), strides=(1, 0) | |
1.5% 20.3% 0.063s 4.19e-05s 1500 2 GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.5% 21.8% 0.063s 4.18e-05s 1500 177 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.5% 23.3% 0.063s 4.17e-05s 1500 175 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
1.4% 24.6% 0.059s 3.95e-05s 1500 79 GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=(1, 0) | |
1.4% 26.0% 0.058s 3.90e-05s 1500 160 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.4% 27.3% 0.058s 3.90e-05s 1500 159 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.3% 28.7% 0.058s 3.87e-05s 1500 9 GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.3% 30.0% 0.058s 3.87e-05s 1500 22 GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
1.3% 31.4% 0.057s 3.83e-05s 1500 16 GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
... (remaining 161 Apply instances account for 68.63%(2.96s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 369KB (376KB) | |
CPU + GPU: 369KB (377KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 393KB (377KB) | |
CPU + GPU: 393KB (378KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 796KB | |
CPU + GPU: 796KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0) | |
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0) | |
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0) | |
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>) | |
48000B [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0) | |
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>) | |
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
... (remaining 161 Apply account for 443232B/1595232B ((27.78%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 2.294225e+00s | |
Total time spent in calling the VM 2.171460e+00s (94.649%) | |
Total overhead (computing slices..) 1.227655e-01s (5.351%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
44.1% 44.1% 0.505s 1.91e-05s C 26400 22 theano.sandbox.cuda.basic_ops.GpuElemwise | |
32.8% 76.9% 0.375s 5.21e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm | |
8.3% 85.2% 0.095s 1.97e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
8.1% 93.3% 0.093s 3.88e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22 | |
3.9% 97.2% 0.044s 1.84e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
1.4% 98.6% 0.016s 3.41e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.9% 99.5% 0.010s 2.16e-06s C 4800 4 theano.compile.ops.Shape_i | |
0.5% 100.0% 0.005s 2.19e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
22.1% 22.1% 0.253s 5.28e-05s C 4800 4 GpuGemm{no_inplace} | |
12.2% 34.4% 0.140s 1.94e-05s C 7200 6 GpuElemwise{mul,no_inplace} | |
10.7% 45.0% 0.122s 5.09e-05s C 2400 2 GpuGemm{inplace} | |
8.1% 53.2% 0.093s 3.88e-05s C 2400 2 GpuDot22 | |
4.4% 57.6% 0.051s 2.11e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
4.3% 61.8% 0.049s 2.03e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
4.2% 66.1% 0.048s 2.02e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
4.1% 70.1% 0.047s 1.94e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
4.0% 74.2% 0.046s 1.92e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:} | |
3.9% 78.1% 0.045s 1.88e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)] | |
3.9% 82.0% 0.045s 1.86e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace} | |
3.9% 85.9% 0.044s 1.84e-05s C 2400 2 GpuAlloc{memset_0=True} | |
3.8% 89.7% 0.044s 1.82e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)] | |
3.8% 93.5% 0.044s 1.82e-05s C 2400 2 GpuElemwise{sub,no_inplace} | |
3.7% 97.2% 0.042s 1.77e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)] | |
0.7% 97.9% 0.008s 3.51e-06s C 2400 2 GpuSubtensor{::, int64::} | |
0.7% 98.6% 0.008s 3.31e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
0.5% 99.1% 0.006s 2.39e-06s C 2400 2 Shape_i{1} | |
0.5% 99.6% 0.005s 2.19e-06s C 2400 2 GpuDimShuffle{1,0} | |
0.4% 100.0% 0.005s 1.94e-06s C 2400 2 Shape_i{0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.8% 5.8% 0.066s 5.50e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
5.5% 11.3% 0.063s 5.26e-05s 1200 7 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
5.4% 16.7% 0.062s 5.19e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.4% 22.1% 0.062s 5.15e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.3% 27.5% 0.061s 5.09e-05s 1200 42 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.3% 32.8% 0.061s 5.08e-05s 1200 43 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.1% 36.9% 0.047s 3.89e-05s 1200 30 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.1% 40.9% 0.046s 3.87e-05s 1200 31 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.2% 43.1% 0.025s 2.12e-05s 1200 44 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=(1, 0) | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.2% 45.3% 0.025s 2.10e-05s 1200 45 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=(1, 0) | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 47.5% 0.025s 2.05e-05s 1200 36 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.1% 49.6% 0.025s 2.05e-05s 1200 26 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 51.8% 0.024s 2.02e-05s 1200 37 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.1% 53.8% 0.024s 1.99e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 55.9% 0.024s 1.99e-05s 1200 27 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 58.0% 0.024s 1.98e-05s 1200 11 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.1% 60.1% 0.024s 1.98e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.0% 62.1% 0.023s 1.95e-05s 1200 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.0% 64.1% 0.023s 1.93e-05s 1200 38 GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.0% 66.1% 0.023s 1.92e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
... (remaining 26 Apply instances account for 33.85%(0.39s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 55KB (78KB) | |
CPU + GPU: 55KB (78KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 66KB (86KB) | |
CPU + GPU: 66KB (86KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 94KB | |
CPU + GPU: 94KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
4000B [(10, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuGemm{no_inplace}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
4000B [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 2.288652e+00s | |
Total time spent in calling the VM 2.166391e+00s (94.658%) | |
Total overhead (computing slices..) 1.222615e-01s (5.342%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
44.0% 44.0% 0.503s 1.91e-05s C 26400 22 theano.sandbox.cuda.basic_ops.GpuElemwise | |
32.9% 76.9% 0.376s 5.22e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm | |
8.3% 85.2% 0.094s 1.97e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
8.2% 93.4% 0.093s 3.89e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22 | |
3.9% 97.2% 0.044s 1.84e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
1.4% 98.7% 0.016s 3.43e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.9% 99.6% 0.010s 2.12e-06s C 4800 4 theano.compile.ops.Shape_i | |
0.4% 100.0% 0.005s 2.11e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
22.2% 22.2% 0.254s 5.28e-05s C 4800 4 GpuGemm{no_inplace} | |
12.2% 34.4% 0.140s 1.94e-05s C 7200 6 GpuElemwise{mul,no_inplace} | |
10.7% 45.1% 0.122s 5.09e-05s C 2400 2 GpuGemm{inplace} | |
8.2% 53.3% 0.093s 3.89e-05s C 2400 2 GpuDot22 | |
4.4% 57.7% 0.051s 2.11e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
4.3% 62.0% 0.049s 2.03e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
4.2% 66.2% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
4.0% 70.3% 0.046s 1.93e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
4.0% 74.3% 0.046s 1.91e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:} | |
3.9% 78.2% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace} | |
3.9% 82.0% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)] | |
3.9% 85.9% 0.044s 1.84e-05s C 2400 2 GpuAlloc{memset_0=True} | |
3.8% 89.7% 0.044s 1.82e-05s C 2400 2 GpuElemwise{sub,no_inplace} | |
3.8% 93.5% 0.044s 1.82e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)] | |
3.7% 97.2% 0.042s 1.75e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)] | |
0.7% 98.0% 0.008s 3.54e-06s C 2400 2 GpuSubtensor{::, int64::} | |
0.7% 98.7% 0.008s 3.32e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
0.5% 99.2% 0.006s 2.38e-06s C 2400 2 Shape_i{1} | |
0.4% 99.6% 0.005s 2.11e-06s C 2400 2 GpuDimShuffle{1,0} | |
0.4% 100.0% 0.004s 1.87e-06s C 2400 2 Shape_i{0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.8% 5.8% 0.066s 5.52e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
5.5% 11.3% 0.063s 5.27e-05s 1200 7 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
5.4% 16.8% 0.062s 5.19e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.4% 22.2% 0.062s 5.16e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.4% 27.5% 0.061s 5.10e-05s 1200 43 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
5.3% 32.9% 0.061s 5.09e-05s 1200 42 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.1% 37.0% 0.047s 3.89e-05s 1200 30 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.1% 41.1% 0.047s 3.88e-05s 1200 31 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.2% 43.3% 0.025s 2.12e-05s 1200 44 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=(1, 0) | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.2% 45.5% 0.025s 2.11e-05s 1200 45 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=(1, 0) | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.2% 47.6% 0.025s 2.05e-05s 1200 36 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.1% 49.8% 0.024s 2.04e-05s 1200 26 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 51.9% 0.024s 2.00e-05s 1200 37 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.1% 54.0% 0.024s 1.99e-05s 1200 27 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 56.1% 0.024s 1.97e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 58.1% 0.024s 1.97e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.1% 60.2% 0.023s 1.95e-05s 1200 11 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.0% 62.2% 0.023s 1.93e-05s 1200 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.0% 64.2% 0.023s 1.93e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.0% 66.3% 0.023s 1.92e-05s 1200 38 GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
... (remaining 26 Apply instances account for 33.75%(0.39s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 55KB (78KB) | |
CPU + GPU: 55KB (78KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 66KB (86KB) | |
CPU + GPU: 66KB (86KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 94KB | |
CPU + GPU: 94KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda]) | |
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: Sum of all(17) printed profiles at exit excluding Scan op profile. | |
Time in 6938 calls to Function.__call__: 1.028157e+02s | |
Time in Function.fn.__call__: 1.024500e+02s (99.644%) | |
Time in thunks: 4.343875e+01s (42.249%) | |
Total compile time: 6.253434e+02s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 2.134617e+02s | |
Theano validate time: 4.772263e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.980593e+02s | |
Import time 1.529284e+01s | |
Time in all call to theano.grad() 2.823545e+00s | |
Time since theano import 834.193s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
66.5% 66.5% 28.871s 3.42e-02s Py 844 11 theano.scan_module.scan_op.Scan | |
21.8% 88.2% 9.454s 5.87e-02s Py 161 2 lvsr.ops.EditDistanceOp | |
4.3% 92.5% 1.874s 2.16e-05s C 86731 877 theano.sandbox.cuda.basic_ops.GpuElemwise | |
1.8% 94.3% 0.779s 3.05e-05s C 25580 252 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
0.9% 95.2% 0.395s 4.64e-05s C 8505 86 theano.sandbox.cuda.blas.GpuDot22 | |
0.8% 96.0% 0.340s 3.54e-06s C 96048 1098 theano.tensor.elemwise.Elemwise | |
0.7% 96.7% 0.313s 1.81e-05s C 17247 197 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.4% 97.2% 0.180s 2.53e-05s C 7127 75 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 97.6% 0.168s 2.24e-05s Py 7505 51 theano.ifelse.IfElse | |
0.3% 97.9% 0.151s 3.26e-06s C 46180 473 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.3% 98.2% 0.150s 2.61e-05s C 5766 61 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.3% 98.6% 0.137s 7.11e-06s C 19212 205 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.3% 98.9% 0.127s 7.95e-06s C 16013 116 theano.compile.ops.DeepCopyOp | |
0.1% 99.0% 0.056s 4.38e-05s C 1280 9 theano.sandbox.cuda.blas.GpuGemm | |
0.1% 99.1% 0.056s 1.55e-05s C 3593 31 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.1% 99.2% 0.054s 3.50e-06s C 15373 167 theano.tensor.opt.MakeVector | |
0.1% 99.4% 0.051s 4.25e-06s C 12067 128 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 99.5% 0.046s 3.25e-06s C 14041 157 theano.compile.ops.Shape_i | |
0.1% 99.5% 0.035s 7.33e-05s C 472 6 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.1% 99.6% 0.033s 5.02e-05s C 648 7 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
... (remaining 24 Classes account for 0.39%(0.17s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
21.8% 21.8% 9.454s 5.87e-02s Py 161 2 EditDistanceOp | |
21.5% 43.2% 9.321s 9.32e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan} | |
14.2% 57.4% 6.165s 6.16e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan&generator_generate_scan} | |
10.6% 68.0% 4.615s 2.31e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan} | |
8.5% 76.5% 3.680s 3.68e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan} | |
4.7% 81.2% 2.026s 3.32e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan} | |
3.7% 84.9% 1.599s 1.60e-02s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
3.2% 88.0% 1.380s 8.57e-03s Py 161 2 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
0.9% 88.9% 0.395s 4.64e-05s C 8505 86 GpuDot22 | |
0.7% 89.7% 0.319s 3.80e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1} | |
0.7% 90.4% 0.313s 1.81e-05s C 17247 197 HostFromGpu | |
0.5% 90.9% 0.207s 2.13e-05s C 9700 97 GpuElemwise{add,no_inplace} | |
0.4% 91.3% 0.173s 2.20e-05s C 7861 79 GpuElemwise{sub,no_inplace} | |
0.4% 91.7% 0.171s 3.57e-05s C 4800 48 GpuCAReduce{add}{1,1} | |
0.4% 92.0% 0.155s 2.38e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace} | |
0.4% 92.4% 0.154s 2.49e-05s Py 6200 39 if{gpu} | |
0.3% 92.7% 0.150s 2.34e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace} | |
0.3% 93.0% 0.134s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)] | |
0.3% 93.3% 0.133s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)] | |
0.3% 93.6% 0.133s 2.29e-05s C 5800 58 GpuElemwise{Switch,no_inplace} | |
... (remaining 328 Ops account for 6.36%(2.76s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
21.5% 21.5% 9.321s 9.32e-02s 100 2406 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 12), strides=c | |
input 2: dtype=float32, shape=(15, 10, 200), strides=c | |
input 3: dtype=float32, shape=(15, 10, 100), strides=c | |
input 4: dtype=float32, shape=(15, 10, 100), strides=c | |
input 5: dtype=float32, shape=(15, 10, 100), strides=c | |
input 6: dtype=float32, shape=(15, 10, 1), strides=c | |
input 7: dtype=float32, shape=(15, 10, 200), strides=c | |
input 8: dtype=float32, shape=(15, 10, 12), strides=c | |
input 9: dtype=float32, shape=(15, 10, 200), strides=c | |
input 10: dtype=float32, shape=(15, 10, 100), strides=c | |
input 11: dtype=float32, shape=(15, 10, 100), strides=c | |
input 12: dtype=float32, shape=(15, 10, 100), strides=c | |
input 13: dtype=float32, shape=(15, 10, 200), strides=c | |
input 14: dtype=float32, shape=(16, 10, 100), strides=c | |
input 15: dtype=float32, shape=(16, 10, 200), strides=c | |
input 16: dtype=float32, shape=(16, 10, 12), strides=c | |
input 17: dtype=float32, shape=(16, 10, 100), strides=c | |
input 18: dtype=float32, shape=(16, 10, 200), strides=c | |
input 19: dtype=float32, shape=(16, 10, 12), strides=c | |
input 20: dtype=float32, shape=(2, 100, 1), strides=c | |
input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
input 23: dtype=float32, shape=(2, 100, 1), strides=c | |
input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
input 26: dtype=int64, shape=(), strides=c | |
input 27: dtype=int64, shape=(), strides=c | |
input 28: dtype=int64, shape=(), strides=c | |
input 29: dtype=int64, shape=(), strides=c | |
input 30: dtype=int64, shape=(), strides=c | |
input 31: dtype=int64, shape=(), strides=c | |
input 32: dtype=int64, shape=(), strides=c | |
input 33: dtype=int64, shape=(), strides=c | |
input 34: dtype=float32, shape=(100, 200), strides=c | |
input 35: dtype=float32, shape=(200, 200), strides=c | |
input 36: dtype=float32, shape=(100, 100), strides=c | |
input 37: dtype=float32, shape=(200, 100), strides=c | |
input 38: dtype=float32, shape=(100, 100), strides=c | |
input 39: dtype=float32, shape=(200, 200), strides=c | |
input 40: dtype=float32, shape=(200, 100), strides=c | |
input 41: dtype=float32, shape=(100, 100), strides=c | |
input 42: dtype=float32, shape=(100, 200), strides=c | |
input 43: dtype=float32, shape=(100, 100), strides=c | |
input 44: dtype=int64, shape=(2,), strides=c | |
input 45: dtype=float32, shape=(12, 10, 100), strides=c | |
input 46: dtype=int64, shape=(1,), strides=c | |
input 47: dtype=float32, shape=(12, 10), strides=c | |
input 48: dtype=float32, shape=(12, 10, 200), strides=c | |
input 49: dtype=float32, shape=(100, 1), strides=c | |
input 50: dtype=int8, shape=(10,), strides=c | |
input 51: dtype=float32, shape=(1, 100), strides=c | |
input 52: dtype=float32, shape=(100, 200), strides=c | |
input 53: dtype=float32, shape=(200, 200), strides=c | |
input 54: dtype=float32, shape=(100, 100), strides=c | |
input 55: dtype=float32, shape=(200, 100), strides=c | |
input 56: dtype=float32, shape=(100, 100), strides=c | |
input 57: dtype=float32, shape=(200, 200), strides=c | |
input 58: dtype=float32, shape=(200, 100), strides=c | |
input 59: dtype=float32, shape=(100, 100), strides=c | |
input 60: dtype=float32, shape=(100, 200), strides=c | |
input 61: dtype=float32, shape=(100, 100), strides=c | |
input 62: dtype=int64, shape=(2,), strides=c | |
input 63: dtype=float32, shape=(12, 10, 100), strides=c | |
input 64: dtype=int64, shape=(1,), strides=c | |
input 65: dtype=float32, shape=(12, 10), strides=c | |
input 66: dtype=float32, shape=(12, 10, 200), strides=c | |
input 67: dtype=float32, shape=(100, 1), strides=c | |
input 68: dtype=int8, shape=(10,), strides=c | |
input 69: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(16, 10, 100), strides=c | |
output 1: dtype=float32, shape=(16, 10, 200), strides=c | |
output 2: dtype=float32, shape=(16, 10, 12), strides=c | |
output 3: dtype=float32, shape=(16, 10, 100), strides=c | |
output 4: dtype=float32, shape=(16, 10, 200), strides=c | |
output 5: dtype=float32, shape=(16, 10, 12), strides=c | |
output 6: dtype=float32, shape=(2, 100, 1), strides=c | |
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
output 9: dtype=float32, shape=(2, 100, 1), strides=c | |
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c | |
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c | |
output 12: dtype=float32, shape=(15, 10, 100), strides=c | |
output 13: dtype=float32, shape=(15, 10, 200), strides=c | |
output 14: dtype=float32, shape=(15, 10, 100), strides=c | |
output 15: dtype=float32, shape=(15, 100, 10), strides=c | |
output 16: dtype=float32, shape=(15, 10, 100), strides=c | |
output 17: dtype=float32, shape=(15, 10, 200), strides=c | |
output 18: dtype=float32, shape=(15, 10, 100), strides=c | |
output 19: dtype=float32, shape=(15, 100, 10), strides=c | |
19.5% 40.9% 8.452s 1.39e-01s 61 279 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask) | |
input 0: dtype=int64, shape=(15, 75), strides=c | |
input 1: dtype=float32, shape=(15, 75), strides=c | |
input 2: dtype=int64, shape=(12, 75), strides=c | |
input 3: dtype=float32, shape=(12, 75), strides=c | |
output 0: dtype=int64, shape=(15, 75, 1), strides=c | |
14.2% 55.1% 6.165s 6.16e-02s 100 1795 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=c | |
input 2: dtype=float32, shape=(1, 10, 200), strides=c | |
input 3: dtype=float32, shape=(1, 92160), strides=c | |
input 4: dtype=float32, shape=(1, 10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 10, 200), strides=c | |
input 6: dtype=float32, shape=(2, 92160), strides=c | |
input 7: dtype=int64, shape=(), strides=c | |
input 8: dtype=int64, shape=(), strides=c | |
input 9: dtype=float32, shape=(100, 44), strides=c | |
input 10: dtype=float32, shape=(200, 44), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(200, 200), strides=c | |
input 13: dtype=float32, shape=(45, 100), strides=c | |
input 14: dtype=float32, shape=(100, 200), strides=c | |
input 15: dtype=float32, shape=(100, 100), strides=c | |
input 16: dtype=float32, shape=(200, 100), strides=c | |
input 17: dtype=float32, shape=(100, 100), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(1, 44), strides=c | |
input 20: dtype=float32, shape=(1, 200), strides=c | |
input 21: dtype=float32, shape=(1, 100), strides=c | |
input 22: dtype=int64, shape=(1,), strides=c | |
input 23: dtype=float32, shape=(12, 10), strides=c | |
input 24: dtype=float32, shape=(12, 10, 200), strides=c | |
input 25: dtype=float32, shape=(100, 1), strides=c | |
input 26: dtype=int8, shape=(10,), strides=c | |
input 27: dtype=float32, shape=(12, 10, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10, 200), strides=c | |
input 29: dtype=float32, shape=(12, 10, 100), strides=c | |
output 0: dtype=float32, shape=(1, 10, 100), strides=c | |
output 1: dtype=float32, shape=(1, 10, 200), strides=c | |
output 2: dtype=float32, shape=(1, 92160), strides=c | |
output 3: dtype=float32, shape=(1, 10, 100), strides=c | |
output 4: dtype=float32, shape=(1, 10, 200), strides=c | |
output 5: dtype=float32, shape=(2, 92160), strides=c | |
output 6: dtype=int64, shape=(15, 10), strides=c | |
output 7: dtype=int64, shape=(15, 10), strides=c | |
8.5% 63.6% 3.680s 3.68e-02s 100 2157 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64 | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 200), strides=c | |
input 2: dtype=float32, shape=(15, 10, 100), strides=c | |
input 3: dtype=float32, shape=(15, 10, 1), strides=c | |
input 4: dtype=float32, shape=(15, 10, 200), strides=c | |
input 5: dtype=float32, shape=(15, 10, 100), strides=c | |
input 6: dtype=float32, shape=(16, 10, 100), strides=c | |
input 7: dtype=float32, shape=(16, 10, 200), strides=c | |
input 8: dtype=float32, shape=(16, 10, 12), strides=c | |
input 9: dtype=float32, shape=(16, 10, 100), strides=c | |
input 10: dtype=float32, shape=(16, 10, 200), strides=c | |
input 11: dtype=float32, shape=(16, 10, 12), strides=c | |
input 12: dtype=float32, shape=(100, 200), strides=c | |
input 13: dtype=float32, shape=(200, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(200, 100), strides=c | |
input 16: dtype=float32, shape=(100, 100), strides=c | |
input 17: dtype=float32, shape=(12, 10), strides=c | |
input 18: dtype=float32, shape=(12, 10, 100), strides=c | |
input 19: dtype=int64, shape=(1,), strides=c | |
input 20: dtype=float32, shape=(12, 10, 200), strides=c | |
input 21: dtype=int8, shape=(10,), strides=c | |
input 22: dtype=float32, shape=(100, 1), strides=c | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(200, 200), strides=c | |
input 25: dtype=float32, shape=(100, 100), strides=c | |
input 26: dtype=float32, shape=(200, 100), strides=c | |
input 27: dtype=float32, shape=(100, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10), strides=c | |
input 29: dtype=float32, shape=(12, 10, 100), strides=c | |
input 30: dtype=int64, shape=(1,), strides=c | |
input 31: dtype=float32, shape=(12, 10, 200), strides=c | |
input 32: dtype=int8, shape=(10,), strides=c | |
input 33: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(16, 10, 100), strides=c | |
output 1: dtype=float32, shape=(16, 10, 200), strides=c | |
output 2: dtype=float32, shape=(16, 10, 12), strides=c | |
output 3: dtype=float32, shape=(16, 10, 100), strides=c | |
output 4: dtype=float32, shape=(16, 10, 200), strides=c | |
output 5: dtype=float32, shape=(16, 10, 12), strides=c | |
5.3% 68.9% 2.311s 2.31e-02s 100 2602 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 100), strides=c | |
input 4: dtype=float32, shape=(12, 10, 1), strides=c | |
input 5: dtype=float32, shape=(12, 10, 200), strides=c | |
input 6: dtype=float32, shape=(12, 10, 100), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(12, 10, 1), strides=c | |
input 9: dtype=float32, shape=(13, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=int64, shape=(), strides=c | |
input 12: dtype=int64, shape=(), strides=c | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=float32, shape=(100, 200), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(200, 100), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(100, 200), strides=c | |
input 22: dtype=float32, shape=(100, 100), strides=c | |
input 23: dtype=float32, shape=(200, 100), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(12, 10, 200), strides=c | |
output 4: dtype=float32, shape=(12, 100, 10), strides=c | |
output 5: dtype=float32, shape=(12, 10, 100), strides=c | |
output 6: dtype=float32, shape=(12, 10, 200), strides=c | |
output 7: dtype=float32, shape=(12, 100, 10), strides=c | |
5.3% 74.2% 2.305s 2.30e-02s 100 2603 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0 | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 100), strides=c | |
input 4: dtype=float32, shape=(12, 10, 1), strides=c | |
input 5: dtype=float32, shape=(12, 10, 200), strides=c | |
input 6: dtype=float32, shape=(12, 10, 100), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(12, 10, 1), strides=c | |
input 9: dtype=float32, shape=(13, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=int64, shape=(), strides=c | |
input 12: dtype=int64, shape=(), strides=c | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=float32, shape=(100, 200), strides=c | |
input 18: dtype=float32, shape=(100, 100), strides=c | |
input 19: dtype=float32, shape=(200, 100), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(100, 200), strides=c | |
input 22: dtype=float32, shape=(100, 100), strides=c | |
input 23: dtype=float32, shape=(200, 100), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(12, 10, 200), strides=c | |
output 4: dtype=float32, shape=(12, 100, 10), strides=c | |
output 5: dtype=float32, shape=(12, 10, 100), strides=c | |
output 6: dtype=float32, shape=(12, 10, 200), strides=c | |
output 7: dtype=float32, shape=(12, 100, 10), strides=c | |
4.7% 78.9% 2.026s 3.32e-02s 61 268 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 75), strides=(75, 1) | |
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(75,), strides=c | |
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 75), strides=c | |
3.7% 82.5% 1.599s 1.60e-02s 100 1601 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 1), strides=c | |
input 4: dtype=float32, shape=(12, 10, 200), strides=c | |
input 5: dtype=float32, shape=(12, 10, 100), strides=c | |
input 6: dtype=float32, shape=(12, 10, 1), strides=c | |
input 7: dtype=float32, shape=(12, 10, 100), strides=c | |
input 8: dtype=float32, shape=(13, 10, 100), strides=c | |
input 9: dtype=float32, shape=(12, 10, 100), strides=c | |
input 10: dtype=float32, shape=(13, 10, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
output 2: dtype=float32, shape=(12, 10, 100), strides=c | |
output 3: dtype=float32, shape=(13, 10, 100), strides=c | |
2.3% 84.9% 1.002s 1.00e-02s 100 1861 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
input 1: dtype=float32, shape=(15, 10), strides=c | |
input 2: dtype=int64, shape=(12, 10), strides=c | |
input 3: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10, 1), strides=c | |
2.0% 86.8% 0.851s 8.51e-03s 100 1611 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
input 2: dtype=float32, shape=(12, 10, 100), strides=c | |
input 3: dtype=float32, shape=(12, 10, 1), strides=c | |
input 4: dtype=float32, shape=(12, 10, 200), strides=c | |
input 5: dtype=float32, shape=(12, 10, 100), strides=c | |
input 6: dtype=float32, shape=(12, 10, 1), strides=c | |
input 7: dtype=float32, shape=(13, 10, 100), strides=c | |
input 8: dtype=float32, shape=(13, 10, 100), strides=c | |
input 9: dtype=float32, shape=(100, 200), strides=c | |
input 10: dtype=float32, shape=(100, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=c | |
output 1: dtype=float32, shape=(13, 10, 100), strides=c | |
1.2% 88.0% 0.528s 8.66e-03s 61 254 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) | |
input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) | |
input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 9: dtype=float32, shape=(100, 200), strides=c | |
input 10: dtype=float32, shape=(100, 100), strides=c | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.1% 88.1% 0.043s 3.88e-03s 11 140 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.1% 88.2% 0.042s 3.79e-03s 11 182 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.1% 88.3% 0.023s 3.81e-06s 6075 0 DeepCopyOp(labels) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
0.0% 88.3% 0.016s 2.59e-06s 6075 1 DeepCopyOp(inputs) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
0.0% 88.3% 0.011s 1.11e-04s 100 2572 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(12, 10, 100), strides=c | |
0.0% 88.4% 0.010s 1.05e-04s 100 2573 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
output 1: dtype=float32, shape=(12, 10, 100), strides=c | |
0.0% 88.4% 0.010s 9.82e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10), strides=c | |
0.0% 88.4% 0.009s 4.94e-05s 176 37 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.0% 88.4% 0.008s 8.06e-05s 100 2356 GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(15, 10), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(14, 10), strides=c | |
output 1: dtype=float32, shape=(1, 10), strides=c | |
... (remaining 4271 Apply instances account for 11.57%(5.03s) of the runtime) | |
Memory Profile (the max between all functions in that profile) | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 57KB (61KB) | |
GPU: 4979KB (6661KB) | |
CPU + GPU: 5035KB (6721KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 56KB (61KB) | |
GPU: 6160KB (7107KB) | |
CPU + GPU: 6216KB (7167KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 115KB | |
GPU: 16958KB | |
CPU + GPU: 17073KB | |
--- | |
This list is based on all functions in the profile | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0) | |
1132320B [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0) | |
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0) | |
720000B [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0) | |
720000B [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
... (remaining 4271 Apply account for 67003141B/82625821B ((81.09%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment