Skip to content

Instantly share code, notes, and snippets.

@rizar
Created May 6, 2016 15:29
Show Gist options
  • Save rizar/0d213d05a9d1f197219c1eb674947496 to your computer and use it in GitHub Desktop.
Save rizar/0d213d05a9d1f197219c1eb674947496 to your computer and use it in GitHub Desktop.
Old profile
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 100 calls to Function.__call__: 1.984119e-03s
Time in Function.fn.__call__: 8.468628e-04s (42.682%)
Total compile time: 5.483155e+00s
Number of Apply nodes: 0
Theano Optimizer time: 1.670289e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 2.310276e-04s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.781s
No execution time accumulated (hint: try config profiling.time_thunks=1)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 11 calls to Function.__call__: 2.355814e-02s
Time in Function.fn.__call__: 2.024937e-02s (85.955%)
Time in thunks: 9.337664e-03s (39.637%)
Total compile time: 6.343132e+00s
Number of Apply nodes: 43
Theano Optimizer time: 3.600280e-01s
Theano validate time: 2.064705e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.223059e-01s
Import time 3.409195e-02s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.781s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.009s 1.97e-05s C 473 43 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.009s 1.97e-05s C 473 43 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.8% 4.8% 0.000s 4.09e-05s 11 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.8% 7.6% 0.000s 2.36e-05s 11 21 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.8% 10.4% 0.000s 2.34e-05s 11 25 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.7% 13.0% 0.000s 2.27e-05s 11 8 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 15.7% 0.000s 2.23e-05s 11 27 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 18.3% 0.000s 2.21e-05s 11 23 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 20.9% 0.000s 2.21e-05s 11 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 23.5% 0.000s 2.19e-05s 11 32 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 26.0% 0.000s 2.19e-05s 11 17 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 28.6% 0.000s 2.17e-05s 11 16 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 31.1% 0.000s 2.16e-05s 11 24 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 33.7% 0.000s 2.15e-05s 11 31 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 36.2% 0.000s 2.15e-05s 11 29 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 38.7% 0.000s 2.14e-05s 11 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 41.2% 0.000s 2.11e-05s 11 3 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 43.7% 0.000s 2.10e-05s 11 28 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 46.2% 0.000s 2.10e-05s 11 36 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 48.6% 0.000s 2.09e-05s 11 33 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 51.1% 0.000s 2.09e-05s 11 5 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 53.5% 0.000s 2.09e-05s 11 35 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 23 Apply instances account for 46.46%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 43 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 10 calls to Function.__call__: 1.226211e-02s
Time in Function.fn.__call__: 1.183033e-02s (96.479%)
Time in thunks: 4.946470e-03s (40.339%)
Total compile time: 6.681131e+00s
Number of Apply nodes: 29
Theano Optimizer time: 1.198421e-01s
Theano validate time: 2.441406e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 1.311059e-01s
Import time 6.275487e-02s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.787s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
52.6% 52.6% 0.003s 1.73e-05s C 150 15 theano.sandbox.cuda.basic_ops.HostFromGpu
44.3% 96.8% 0.002s 2.43e-05s C 90 9 theano.sandbox.cuda.basic_ops.GpuElemwise
3.2% 100.0% 0.000s 3.13e-06s C 50 5 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
52.6% 52.6% 0.003s 1.73e-05s C 150 15 HostFromGpu
44.3% 96.8% 0.002s 2.43e-05s C 90 9 GpuElemwise{true_div,no_inplace}
3.2% 100.0% 0.000s 3.13e-06s C 50 5 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
10.8% 10.8% 0.001s 5.32e-05s 10 0 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
6.2% 17.0% 0.000s 3.09e-05s 10 15 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
4.4% 21.4% 0.000s 2.20e-05s 10 13 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.3% 25.8% 0.000s 2.14e-05s 10 1 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.2% 29.9% 0.000s 2.06e-05s 10 12 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 34.1% 0.000s 2.05e-05s 10 2 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 38.2% 0.000s 2.05e-05s 10 4 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 42.3% 0.000s 2.03e-05s 10 5 GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 46.4% 0.000s 2.03e-05s 10 3 GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 50.5% 0.000s 2.01e-05s 10 6 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 54.1% 0.000s 1.77e-05s 10 19 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.5% 57.5% 0.000s 1.72e-05s 10 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 61.0% 0.000s 1.70e-05s 10 7 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 64.4% 0.000s 1.67e-05s 10 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 67.7% 0.000s 1.67e-05s 10 8 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.3% 71.0% 0.000s 1.64e-05s 10 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.3% 74.4% 0.000s 1.64e-05s 10 26 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.3% 77.7% 0.000s 1.64e-05s 10 21 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.3% 81.0% 0.000s 1.63e-05s 10 27 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.3% 84.2% 0.000s 1.62e-05s 10 20 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
... (remaining 9 Apply instances account for 15.76%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 29 Apply account for 136B/136B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 101 calls to Function.__call__: 1.714706e-02s
Time in Function.fn.__call__: 1.415157e-02s (82.531%)
Time in thunks: 2.484560e-03s (14.490%)
Total compile time: 6.216795e+00s
Number of Apply nodes: 6
Theano Optimizer time: 4.745817e-02s
Theano validate time: 1.499653e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 2.376604e-02s
Import time 1.632404e-02s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.791s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
54.5% 54.5% 0.001s 3.35e-06s C 404 4 theano.compile.ops.Shape_i
45.5% 100.0% 0.001s 5.60e-06s C 202 2 theano.tensor.basic.Alloc
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
45.5% 45.5% 0.001s 5.60e-06s C 202 2 Alloc
31.1% 76.6% 0.001s 3.82e-06s C 202 2 Shape_i{1}
23.4% 100.0% 0.001s 2.88e-06s C 202 2 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
28.2% 28.2% 0.001s 6.94e-06s 101 4 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
input 0: dtype=int64, shape=(1, 1), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(15, 10), strides=c
19.2% 47.4% 0.000s 4.71e-06s 101 0 Shape_i{1}(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
17.3% 64.7% 0.000s 4.27e-06s 101 5 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
input 0: dtype=int64, shape=(1, 1), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(12, 10), strides=c
12.8% 77.5% 0.000s 3.16e-06s 101 1 Shape_i{0}(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
11.9% 89.4% 0.000s 2.92e-06s 101 2 Shape_i{1}(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
10.6% 100.0% 0.000s 2.60e-06s 101 3 Shape_i{0}(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 2KB
GPU: 0KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1200B [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
... (remaining 5 Apply account for 992B/2192B ((45.26%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 100 calls to Function.__call__: 1.662898e-02s
Time in Function.fn.__call__: 1.507092e-02s (90.630%)
Time in thunks: 1.027775e-02s (61.806%)
Total compile time: 5.965592e+00s
Number of Apply nodes: 2
Theano Optimizer time: 1.966500e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 2.714872e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.793s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.010s 5.14e-05s C 200 2 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.010s 5.14e-05s C 200 2 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
95.6% 95.6% 0.010s 9.82e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(15, 10), strides=c
4.4% 100.0% 0.000s 4.54e-06s 100 1 DeepCopyOp(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(12, 10), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 2KB
GPU: 0KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1200B [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction)
... (remaining 1 Apply account for 960B/2160B ((44.44%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 2 calls to Function.__call__: 5.192757e-03s
Time in Function.fn.__call__: 4.395008e-03s (84.637%)
Time in thunks: 1.830101e-03s (35.243%)
Total compile time: 5.798583e+00s
Number of Apply nodes: 31
Theano Optimizer time: 1.590829e-01s
Theano validate time: 1.525164e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 4.815388e-02s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.794s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.002s 2.95e-05s C 62 31 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.002s 2.95e-05s C 62 31 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.6% 4.6% 0.000s 4.20e-05s 2 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.3% 8.9% 0.000s 3.96e-05s 2 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.1% 13.1% 0.000s 3.79e-05s 2 23 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.1% 17.1% 0.000s 3.74e-05s 2 13 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.9% 21.0% 0.000s 3.55e-05s 2 4 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.7% 24.7% 0.000s 3.40e-05s 2 21 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.7% 28.5% 0.000s 3.40e-05s 2 14 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.7% 32.2% 0.000s 3.40e-05s 2 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 35.8% 0.000s 3.30e-05s 2 3 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 39.3% 0.000s 3.25e-05s 2 8 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 42.9% 0.000s 3.25e-05s 2 7 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 46.4% 0.000s 3.21e-05s 2 15 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 49.9% 0.000s 3.21e-05s 2 9 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 53.3% 0.000s 3.16e-05s 2 16 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 56.8% 0.000s 3.16e-05s 2 5 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 60.2% 0.000s 3.15e-05s 2 22 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 63.7% 0.000s 3.15e-05s 2 20 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 67.1% 0.000s 3.15e-05s 2 19 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 70.6% 0.000s 3.15e-05s 2 18 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 74.0% 0.000s 3.11e-05s 2 17 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 11 Apply instances account for 26.04%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 31 Apply account for 140B/140B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 8.800030e-04s
Time in Function.fn.__call__: 8.380413e-04s (95.232%)
Time in thunks: 3.595352e-04s (40.856%)
Total compile time: 6.387939e+00s
Number of Apply nodes: 21
Theano Optimizer time: 8.277297e-02s
Theano validate time: 1.749992e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 4.883909e-02s
Import time 4.663944e-03s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.798s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
53.4% 53.4% 0.000s 1.75e-05s C 11 11 theano.sandbox.cuda.basic_ops.HostFromGpu
42.3% 95.8% 0.000s 2.54e-05s C 6 6 theano.sandbox.cuda.basic_ops.GpuElemwise
4.2% 100.0% 0.000s 3.81e-06s C 4 4 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
53.4% 53.4% 0.000s 1.75e-05s C 11 11 HostFromGpu
42.3% 95.8% 0.000s 2.54e-05s C 6 6 GpuElemwise{true_div,no_inplace}
4.2% 100.0% 0.000s 3.81e-06s C 4 4 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
11.7% 11.7% 0.000s 4.20e-05s 1 8 GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.4% 18.0% 0.000s 2.29e-05s 1 1 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.2% 24.2% 0.000s 2.22e-05s 1 3 GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.1% 30.3% 0.000s 2.19e-05s 1 7 GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.1% 36.4% 0.000s 2.19e-05s 1 2 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.1% 42.5% 0.000s 2.19e-05s 1 0 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.9% 48.4% 0.000s 2.12e-05s 1 6 GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.0% 53.4% 0.000s 1.81e-05s 1 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.0% 58.5% 0.000s 1.81e-05s 1 12 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.0% 63.5% 0.000s 1.79e-05s 1 11 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.8% 68.2% 0.000s 1.72e-05s 1 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.8% 73.0% 0.000s 1.72e-05s 1 13 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.7% 77.7% 0.000s 1.69e-05s 1 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.7% 82.4% 0.000s 1.69e-05s 1 5 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.4% 86.9% 0.000s 1.60e-05s 1 10 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.4% 91.3% 0.000s 1.60e-05s 1 9 HostFromGpu(shared_weights_penalty)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.4% 95.8% 0.000s 1.60e-05s 1 4 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
1.4% 97.1% 0.000s 5.01e-06s 1 19 Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
1.1% 98.3% 0.000s 4.05e-06s 1 20 Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0)
input 0: dtype=float64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
0.9% 99.1% 0.000s 3.10e-06s 1 15 Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 1 Apply instances account for 0.86%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 21 Apply account for 100B/100B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 1 calls to Function.__call__: 4.670620e-04s
Time in Function.fn.__call__: 3.008842e-04s (64.421%)
Time in thunks: 1.330376e-04s (28.484%)
Total compile time: 7.051143e+00s
Number of Apply nodes: 5
Theano Optimizer time: 3.080988e-02s
Theano validate time: 2.636909e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 9.856939e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.801s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.66e-05s C 5 5 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.000s 2.66e-05s C 5 5 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
57.2% 57.2% 0.000s 7.61e-05s 1 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
16.5% 73.7% 0.000s 2.19e-05s 1 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
15.1% 88.7% 0.000s 2.00e-05s 1 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
8.2% 97.0% 0.000s 1.10e-05s 1 3 DeepCopyOp(TensorConstant{0})
input 0: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(), strides=c
3.0% 100.0% 0.000s 4.05e-06s 1 4 DeepCopyOp(TensorConstant{0.0})
input 0: dtype=float64, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 5 Apply account for 28B/28B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 1.440048e-04s
Time in Function.fn.__call__: 1.199245e-04s (83.278%)
Time in thunks: 3.504753e-05s (24.338%)
Total compile time: 5.531962e+00s
Number of Apply nodes: 3
Theano Optimizer time: 2.350092e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 4.795074e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.802s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
71.4% 71.4% 0.000s 2.50e-05s C 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu
17.0% 88.4% 0.000s 5.96e-06s C 1 1 theano.compile.ops.DeepCopyOp
11.6% 100.0% 0.000s 4.05e-06s C 1 1 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.4% 71.4% 0.000s 2.50e-05s C 1 1 HostFromGpu
17.0% 88.4% 0.000s 5.96e-06s C 1 1 DeepCopyOp
11.6% 100.0% 0.000s 4.05e-06s C 1 1 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
71.4% 71.4% 0.000s 2.50e-05s 1 1 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
17.0% 88.4% 0.000s 5.96e-06s 1 0 DeepCopyOp(shared_batch_size)
input 0: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(), strides=c
11.6% 100.0% 0.000s 4.05e-06s 1 2 Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0)
input 0: dtype=float64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 3 Apply account for 20B/20B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
Time in 61 calls to Function.__call__: 1.151932e+01s
Time in Function.fn.__call__: 1.151220e+01s (99.938%)
Time in thunks: 1.112233e+01s (96.554%)
Total compile time: 6.020690e+01s
Number of Apply nodes: 284
Theano Optimizer time: 6.218818e+00s
Theano validate time: 2.867708e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 4.509264e+01s
Import time 3.776977e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.803s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
76.0% 76.0% 8.452s 1.39e-01s Py 61 1 lvsr.ops.EditDistanceOp
23.0% 99.0% 2.554s 2.09e-02s Py 122 2 theano.scan_module.scan_op.Scan
0.2% 99.1% 0.021s 2.85e-06s C 7320 120 theano.tensor.elemwise.Elemwise
0.2% 99.3% 0.020s 6.64e-05s C 305 5 theano.sandbox.cuda.blas.GpuDot22
0.1% 99.4% 0.013s 3.00e-05s C 427 7 theano.sandbox.cuda.basic_ops.GpuElemwise
0.1% 99.5% 0.008s 3.37e-05s C 244 4 theano.sandbox.cuda.basic_ops.GpuAlloc
0.1% 99.6% 0.007s 1.22e-04s C 61 1 theano.sandbox.cuda.basic_ops.GpuJoin
0.1% 99.6% 0.007s 2.16e-05s C 305 5 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.0% 99.7% 0.005s 2.85e-06s C 1586 26 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 99.7% 0.004s 2.92e-06s C 1464 24 theano.compile.ops.Shape_i
0.0% 99.8% 0.004s 2.23e-05s C 183 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.0% 99.8% 0.003s 5.64e-05s C 61 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
0.0% 99.8% 0.003s 3.31e-06s C 1037 17 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 99.8% 0.003s 2.76e-05s C 122 2 theano.compile.ops.DeepCopyOp
0.0% 99.9% 0.003s 2.74e-06s C 1098 18 theano.tensor.opt.MakeVector
0.0% 99.9% 0.002s 2.32e-06s C 1037 17 theano.tensor.basic.ScalarFromTensor
0.0% 99.9% 0.002s 7.52e-06s Py 305 3 theano.ifelse.IfElse
0.0% 99.9% 0.002s 4.18e-06s C 549 9 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.002s 5.26e-06s C 305 5 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.001s 6.62e-06s Py 183 3 theano.compile.ops.Rebroadcast
... (remaining 8 Classes account for 0.04%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
76.0% 76.0% 8.452s 1.39e-01s Py 61 1 EditDistanceOp
18.2% 94.2% 2.026s 3.32e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan}
4.8% 99.0% 0.528s 8.66e-03s Py 61 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
0.2% 99.1% 0.020s 6.64e-05s C 305 5 GpuDot22
0.1% 99.2% 0.010s 3.33e-05s C 305 5 GpuElemwise{Add}[(0, 0)]
0.1% 99.3% 0.007s 1.22e-04s C 61 1 GpuJoin
0.1% 99.4% 0.007s 3.74e-05s C 183 3 GpuAlloc
0.1% 99.4% 0.007s 2.16e-05s C 305 5 GpuIncSubtensor{InplaceSet;:int64:}
0.0% 99.5% 0.004s 2.23e-05s C 183 3 HostFromGpu
0.0% 99.5% 0.003s 5.64e-05s C 61 1 GpuAdvancedSubtensor1
0.0% 99.5% 0.003s 2.76e-05s C 122 2 DeepCopyOp
0.0% 99.5% 0.003s 2.74e-06s C 1098 18 MakeVector{dtype='int64'}
0.0% 99.6% 0.002s 2.32e-06s C 1037 17 ScalarFromTensor
0.0% 99.6% 0.002s 2.80e-06s C 793 13 Shape_i{0}
0.0% 99.6% 0.002s 3.21e-06s C 671 11 GpuReshape{2}
0.0% 99.6% 0.002s 3.06e-06s C 671 11 Shape_i{1}
0.0% 99.6% 0.002s 2.64e-06s C 671 11 Elemwise{add,no_inplace}
0.0% 99.7% 0.002s 2.66e-06s C 610 10 Elemwise{sub,no_inplace}
0.0% 99.7% 0.002s 5.26e-06s C 305 5 GpuAllocEmpty
0.0% 99.7% 0.002s 8.29e-06s Py 183 2 if{inplace}
... (remaining 76 Ops account for 0.32%(0.04s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
76.0% 76.0% 8.452s 1.39e-01s 61 279 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
input 0: dtype=int64, shape=(15, 75), strides=c
input 1: dtype=float32, shape=(15, 75), strides=c
input 2: dtype=int64, shape=(12, 75), strides=c
input 3: dtype=float32, shape=(12, 75), strides=c
output 0: dtype=int64, shape=(15, 75, 1), strides=c
18.2% 94.2% 2.026s 3.32e-02s 61 268 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 75), strides=(75, 1)
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(75,), strides=c
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 75), strides=c
4.8% 99.0% 0.528s 8.66e-03s 61 254 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1)
input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0)
input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 9: dtype=float32, shape=(100, 200), strides=c
input 10: dtype=float32, shape=(100, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.1% 99.0% 0.007s 1.22e-04s 61 262 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.0% 99.1% 0.005s 7.75e-05s 61 148 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.0% 99.1% 0.005s 7.65e-05s 61 150 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.0% 99.1% 0.004s 7.11e-05s 61 265 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.2% 0.003s 5.64e-05s 61 72 GpuAdvancedSubtensor1(W, Reshape{1}.0)
input 0: dtype=float32, shape=(44, 100), strides=c
input 1: dtype=int64, shape=(900,), strides=c
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.2% 0.003s 5.48e-05s 61 147 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.2% 0.003s 5.23e-05s 61 149 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.3% 0.002s 3.99e-05s 61 53 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.3% 0.002s 3.81e-05s 61 178 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.0% 99.3% 0.002s 3.76e-05s 61 180 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.0% 99.3% 0.002s 3.63e-05s 61 65 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.3% 0.002s 3.61e-05s 61 116 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.4% 0.002s 3.24e-05s 61 4 DeepCopyOp(CudaNdarrayConstant{1.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
0.0% 99.4% 0.002s 3.13e-05s 61 177 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.4% 0.002s 3.02e-05s 61 267 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.4% 0.002s 2.95e-05s 61 179 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.4% 0.002s 2.76e-05s 61 0 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 264 Apply instances account for 0.58%(0.06s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 18KB (18KB)
GPU: 3168KB (3653KB)
CPU + GPU: 3185KB (3671KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 18KB (18KB)
GPU: 3519KB (4327KB)
CPU + GPU: 3537KB (4345KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 37KB
GPU: 5180KB
CPU + GPU: 5217KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
720000B [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
720000B [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0)
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
360000B [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
360000B [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
... (remaining 264 Apply account for 8196678B/20973438B ((39.08%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 61 calls of the op (for a total of 732 steps) 5.235906e-01s
Total time spent in calling the VM 5.032728e-01s (96.120%)
Total overhead (computing slices..) 2.031779e-02s (3.880%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
54.6% 54.6% 0.153s 5.23e-05s C 2928 4 theano.sandbox.cuda.blas.GpuGemm
42.1% 96.7% 0.118s 2.02e-05s C 5856 8 theano.sandbox.cuda.basic_ops.GpuElemwise
3.3% 100.0% 0.009s 3.15e-06s C 2928 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
54.6% 54.6% 0.153s 5.23e-05s C 2928 4 GpuGemm{no_inplace}
11.6% 66.2% 0.033s 2.22e-05s C 1464 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
10.4% 76.6% 0.029s 2.00e-05s C 1464 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
10.1% 86.7% 0.028s 1.93e-05s C 1464 2 GpuElemwise{mul,no_inplace}
10.0% 96.7% 0.028s 1.92e-05s C 1464 2 GpuElemwise{sub,no_inplace}
1.8% 98.5% 0.005s 3.36e-06s C 1464 2 GpuSubtensor{::, :int64:}
1.5% 100.0% 0.004s 2.93e-06s C 1464 2 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
13.8% 13.8% 0.039s 5.31e-05s 732 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
13.6% 27.5% 0.038s 5.22e-05s 732 3 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
13.6% 41.0% 0.038s 5.20e-05s 732 12 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
13.6% 54.6% 0.038s 5.20e-05s 732 13 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.9% 60.5% 0.016s 2.25e-05s 732 14 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(75, 1), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.7% 66.2% 0.016s 2.20e-05s 732 15 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(75, 1), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.3% 71.5% 0.015s 2.02e-05s 732 4 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
5.2% 76.7% 0.015s 2.00e-05s 732 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 1), strides=c
5.2% 81.9% 0.015s 1.98e-05s 732 5 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
5.0% 86.9% 0.014s 1.93e-05s 732 10 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.0% 91.9% 0.014s 1.93e-05s 732 11 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
4.8% 96.7% 0.013s 1.83e-05s 732 2 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 1), strides=c
0.9% 97.6% 0.003s 3.44e-06s 732 8 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.9% 98.5% 0.002s 3.29e-06s 732 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.8% 99.3% 0.002s 3.11e-06s 732 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.7% 100.0% 0.002s 2.75e-06s 732 9 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 147KB (206KB)
CPU + GPU: 147KB (206KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 147KB (206KB)
CPU + GPU: 147KB (206KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 294KB
CPU + GPU: 294KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
... (remaining 2 Apply account for 600B/540600B ((0.11%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( generator_generate_scan )
==================
Message: None
Time in 61 calls of the op (for a total of 915 steps) 2.016554e+00s
Total time spent in calling the VM 1.933907e+00s (95.902%)
Total overhead (computing slices..) 8.264709e-02s (4.098%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
27.2% 27.2% 0.275s 2.31e-05s C 11895 13 theano.sandbox.cuda.basic_ops.GpuElemwise
20.7% 47.9% 0.209s 4.58e-05s C 4575 5 theano.sandbox.cuda.blas.GpuDot22
20.3% 68.3% 0.205s 4.49e-05s C 4575 5 theano.sandbox.cuda.blas.GpuGemm
10.4% 78.7% 0.105s 2.29e-05s C 4575 5 theano.sandbox.cuda.basic_ops.GpuCAReduce
4.0% 82.7% 0.041s 4.47e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
3.8% 86.5% 0.038s 2.09e-05s C 1830 2 theano.sandbox.cuda.basic_ops.HostFromGpu
3.4% 89.9% 0.034s 3.75e-05s C 915 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
2.3% 92.2% 0.024s 2.58e-05s C 915 1 theano.tensor.basic.MaxAndArgmax
1.4% 93.7% 0.014s 2.26e-06s C 6405 7 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.3% 95.0% 0.014s 1.48e-05s C 915 1 theano.sandbox.multinomial.MultinomialFromUniform
1.2% 96.2% 0.012s 1.35e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuFromHost
1.0% 97.2% 0.010s 2.21e-06s C 4575 5 theano.compile.ops.Shape_i
0.9% 98.1% 0.009s 3.13e-06s C 2745 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.7% 98.7% 0.007s 1.84e-06s C 3660 4 theano.tensor.opt.MakeVector
0.6% 99.3% 0.006s 3.25e-06s C 1830 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.4% 99.7% 0.004s 2.12e-06s C 1830 2 theano.tensor.elemwise.Elemwise
0.3% 100.0% 0.003s 3.20e-06s C 915 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
20.7% 20.7% 0.209s 4.58e-05s C 4575 5 GpuDot22
20.3% 41.1% 0.205s 4.49e-05s C 4575 5 GpuGemm{inplace}
5.4% 46.5% 0.055s 3.01e-05s C 1830 2 GpuElemwise{mul,no_inplace}
4.0% 50.6% 0.041s 4.47e-05s C 915 1 GpuAdvancedSubtensor1
3.8% 54.3% 0.038s 2.09e-05s C 1830 2 HostFromGpu
3.4% 57.7% 0.034s 3.75e-05s C 915 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
2.7% 60.4% 0.027s 2.98e-05s C 915 1 GpuElemwise{add,no_inplace}
2.6% 63.0% 0.026s 2.83e-05s C 915 1 GpuCAReduce{add}{1,0,0}
2.4% 65.4% 0.024s 2.62e-05s C 915 1 GpuCAReduce{maximum}{1,0}
2.3% 67.7% 0.024s 2.58e-05s C 915 1 MaxAndArgmax
2.3% 70.0% 0.023s 2.57e-05s C 915 1 GpuElemwise{Tanh}[(0, 0)]
2.1% 72.2% 0.022s 2.36e-05s C 915 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
2.0% 74.2% 0.021s 2.25e-05s C 915 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
1.9% 76.1% 0.019s 2.06e-05s C 915 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
1.9% 77.9% 0.019s 2.05e-05s C 915 1 GpuCAReduce{maximum}{0,1}
1.8% 79.8% 0.019s 2.03e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
1.8% 81.6% 0.018s 2.00e-05s C 915 1 GpuElemwise{TrueDiv}[(0, 0)]
1.8% 83.4% 0.018s 2.00e-05s C 915 1 GpuCAReduce{add}{1,0}
1.8% 85.2% 0.018s 1.99e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
1.8% 87.0% 0.018s 1.99e-05s C 915 1 GpuElemwise{Add}[(0, 1)]
... (remaining 21 Ops account for 13.00%(0.13s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
6.7% 6.7% 0.067s 7.36e-05s 915 47 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(900, 100), strides=c
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(900, 1), strides=c
4.4% 11.1% 0.045s 4.87e-05s 915 39 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
4.4% 15.4% 0.044s 4.83e-05s 915 11 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
4.3% 19.8% 0.044s 4.76e-05s 915 9 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 44), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
4.0% 23.8% 0.041s 4.47e-05s 915 30 GpuAdvancedSubtensor1(W_copy[cuda], argmax)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(75,), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.8% 27.6% 0.039s 4.24e-05s 915 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
3.6% 31.3% 0.037s 4.01e-05s 915 40 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.6% 34.9% 0.037s 4.00e-05s 915 57 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
input 0: dtype=float32, shape=(12, 75, 1), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=c
output 0: dtype=float32, shape=(12, 75, 200), strides=c
3.6% 38.5% 0.036s 3.99e-05s 915 33 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
3.5% 42.0% 0.035s 3.84e-05s 915 6 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
3.4% 45.4% 0.034s 3.75e-05s 915 14 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(92160,), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=float32, shape=(92160,), strides=c
output 1: dtype=float32, shape=(75,), strides=c
3.4% 48.8% 0.034s 3.73e-05s 915 38 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.4% 52.1% 0.034s 3.73e-05s 915 42 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
2.7% 54.8% 0.027s 2.98e-05s 915 44 GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=c
2.6% 57.4% 0.026s 2.83e-05s 915 58 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
2.4% 59.8% 0.024s 2.62e-05s 915 49 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 75), strides=c
output 0: dtype=float32, shape=(75,), strides=c
2.3% 62.1% 0.024s 2.58e-05s 915 28 MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1})
input 0: dtype=int64, shape=(75, 44), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=int64, shape=(75,), strides=c
output 1: dtype=int64, shape=(75,), strides=c
2.3% 64.4% 0.023s 2.57e-05s 915 46 GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=c
output 0: dtype=float32, shape=(900, 100), strides=c
2.1% 66.6% 0.022s 2.36e-05s 915 26 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(75, 44), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
2.1% 68.7% 0.022s 2.36e-05s 915 41 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
... (remaining 39 Apply instances account for 31.29%(0.32s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 39KB (39KB)
GPU: 1151KB (1151KB)
CPU + GPU: 1190KB (1190KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 39KB (39KB)
GPU: 1151KB (1151KB)
CPU + GPU: 1190KB (1190KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41KB
GPU: 1709KB
CPU + GPU: 1750KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
720000B [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
368940B [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
360000B [(12, 75, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace[cuda])
360000B [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
360000B [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
360000B [(12, 75, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
60000B [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
60000B [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
60000B [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
60000B [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
60000B [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
30000B [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
30000B [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
30000B [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
30000B [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
30000B [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
30000B [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
... (remaining 39 Apply account for 188879B/3287819B ((5.74%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103
Time in 11 calls to Function.__call__: 1.403611e-01s
Time in Function.fn.__call__: 1.400502e-01s (99.779%)
Time in thunks: 9.480190e-02s (67.541%)
Total compile time: 6.756872e+01s
Number of Apply nodes: 190
Theano Optimizer time: 4.246896e+00s
Theano validate time: 1.580198e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 5.792800e+01s
Import time 1.193612e-01s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.896s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
89.0% 89.0% 0.084s 3.84e-03s Py 22 2 theano.scan_module.scan_op.Scan
3.0% 92.0% 0.003s 2.59e-06s C 1089 99 theano.tensor.elemwise.Elemwise
1.9% 93.9% 0.002s 4.05e-05s C 44 4 theano.sandbox.cuda.blas.GpuDot22
0.9% 94.8% 0.001s 1.99e-05s C 44 4 theano.sandbox.cuda.basic_ops.GpuElemwise
0.7% 95.5% 0.001s 6.24e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuJoin
0.6% 96.1% 0.001s 2.50e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuAlloc
0.5% 96.7% 0.001s 4.63e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
0.5% 97.1% 0.000s 2.08e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 97.6% 0.000s 3.18e-06s C 132 12 theano.sandbox.cuda.basic_ops.GpuReshape
0.4% 98.0% 0.000s 2.77e-06s C 143 13 theano.compile.ops.Shape_i
0.4% 98.4% 0.000s 2.77e-06s C 143 13 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 98.8% 0.000s 2.48e-06s C 132 12 theano.tensor.opt.MakeVector
0.3% 99.1% 0.000s 2.21e-06s C 132 12 theano.tensor.basic.ScalarFromTensor
0.3% 99.3% 0.000s 2.39e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.3% 99.6% 0.000s 3.71e-06s C 66 6 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.2% 99.8% 0.000s 6.49e-06s Py 22 2 theano.compile.ops.Rebroadcast
0.1% 99.9% 0.000s 6.40e-06s C 22 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.1% 100.0% 0.000s 5.25e-06s C 11 1 theano.tensor.basic.Alloc
0.0% 100.0% 0.000s 3.21e-06s C 11 1 theano.tensor.basic.Reshape
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
89.0% 89.0% 0.084s 3.84e-03s Py 22 2 forall_inplace,gpu,gatedrecurrent_apply_scan}
1.9% 90.9% 0.002s 4.05e-05s C 44 4 GpuDot22
0.9% 91.8% 0.001s 1.99e-05s C 44 4 GpuElemwise{Add}[(0, 0)]
0.7% 92.6% 0.001s 6.24e-05s C 11 1 GpuJoin
0.6% 93.1% 0.001s 2.50e-05s C 22 2 GpuAlloc
0.5% 93.7% 0.001s 4.63e-05s C 11 1 GpuAdvancedSubtensor1
0.5% 94.2% 0.000s 2.08e-05s C 22 2 GpuIncSubtensor{InplaceSet;:int64:}
0.3% 94.5% 0.000s 2.48e-06s C 132 12 MakeVector{dtype='int64'}
0.3% 94.8% 0.000s 2.21e-06s C 132 12 ScalarFromTensor
0.3% 95.1% 0.000s 2.39e-05s C 11 1 HostFromGpu
0.3% 95.4% 0.000s 2.57e-06s C 99 9 Elemwise{add,no_inplace}
0.3% 95.6% 0.000s 3.19e-06s C 77 7 GpuReshape{2}
0.2% 95.9% 0.000s 2.97e-06s C 77 7 Shape_i{0}
0.2% 96.1% 0.000s 2.43e-06s C 88 8 Elemwise{le,no_inplace}
0.2% 96.3% 0.000s 2.83e-06s C 66 6 GpuDimShuffle{x,x,0}
0.2% 96.5% 0.000s 3.16e-06s C 55 5 GpuReshape{3}
0.2% 96.6% 0.000s 2.55e-06s C 66 6 Shape_i{1}
0.2% 96.8% 0.000s 2.50e-06s C 66 6 Elemwise{sub,no_inplace}
0.2% 97.0% 0.000s 6.49e-06s Py 22 2 Rebroadcast{0}
0.1% 97.1% 0.000s 6.40e-06s C 22 2 GpuAllocEmpty
... (remaining 56 Ops account for 2.88%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
45.0% 45.0% 0.043s 3.88e-03s 11 140 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
44.0% 89.0% 0.042s 3.79e-03s 11 182 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.7% 89.8% 0.001s 6.24e-05s 11 188 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.5% 90.3% 0.001s 4.63e-05s 11 31 GpuAdvancedSubtensor1(W, Reshape{1}.0)
input 0: dtype=float32, shape=(44, 100), strides=c
input 1: dtype=int64, shape=(12,), strides=c
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.5% 90.8% 0.000s 4.20e-05s 11 58 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(12, 200), strides=(200, 1)
0.5% 91.2% 0.000s 4.02e-05s 11 55 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.5% 91.7% 0.000s 4.00e-05s 11 57 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.5% 92.2% 0.000s 3.98e-05s 11 56 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(12, 200), strides=(200, 1)
0.3% 92.5% 0.000s 2.64e-05s 11 103 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.3% 92.8% 0.000s 2.39e-05s 11 189 HostFromGpu(GpuJoin.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=c
0.3% 93.0% 0.000s 2.36e-05s 11 71 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.2% 93.3% 0.000s 2.14e-05s 11 137 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 93.5% 0.000s 2.01e-05s 11 167 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 93.7% 0.000s 2.01e-05s 11 78 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 94.0% 0.000s 1.99e-05s 11 79 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.2% 94.2% 0.000s 1.99e-05s 11 80 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 94.4% 0.000s 1.97e-05s 11 81 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.1% 94.5% 0.000s 7.91e-06s 11 124 GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.1% 94.6% 0.000s 6.59e-06s 11 132 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.1% 94.7% 0.000s 6.39e-06s 11 98 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
... (remaining 170 Apply instances account for 5.32%(0.01s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 9KB (9KB)
GPU: 28KB (34KB)
CPU + GPU: 38KB (43KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 9KB (9KB)
GPU: 33KB (38KB)
CPU + GPU: 42KB (48KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 10KB
GPU: 52KB
CPU + GPU: 63KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
80000B [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
80000B [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
80000B [(100, 200)] v GpuDimShuffle{0,1}(W)
80000B [(100, 200)] v GpuDimShuffle{0,1}(W)
40000B [(100, 100)] v GpuDimShuffle{0,1}(W)
40000B [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
40000B [(100, 100)] v GpuDimShuffle{0,1}(W)
40000B [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
9600B [(12, 1, 200)] c HostFromGpu(GpuJoin.0)
4800B [(12, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
4800B [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
... (remaining 170 Apply account for 94077B/679677B ((13.84%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan )
==================
Message: None
Time in 11 calls of the op (for a total of 132 steps) 4.214001e-02s
Total time spent in calling the VM 4.016280e-02s (95.308%)
Total overhead (computing slices..) 1.977205e-03s (4.692%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
61.2% 61.2% 0.013s 4.91e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm
34.9% 96.2% 0.007s 1.87e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise
3.8% 100.0% 0.001s 3.07e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
61.2% 61.2% 0.013s 4.91e-05s C 264 2 GpuGemm{no_inplace}
11.9% 73.1% 0.003s 1.90e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
11.6% 84.7% 0.002s 1.86e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
11.5% 96.2% 0.002s 1.84e-05s C 132 1 GpuElemwise{mul,no_inplace}
2.1% 98.3% 0.000s 3.34e-06s C 132 1 GpuSubtensor{::, :int64:}
1.7% 100.0% 0.000s 2.80e-06s C 132 1 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
31.4% 31.4% 0.007s 5.04e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
29.8% 61.2% 0.006s 4.79e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
11.9% 73.1% 0.003s 1.90e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
11.6% 84.7% 0.002s 1.86e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
11.5% 96.2% 0.002s 1.84e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
2.1% 98.3% 0.000s 3.34e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
1.7% 100.0% 0.000s 2.80e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan )
==================
Message: None
Time in 11 calls of the op (for a total of 132 steps) 4.123449e-02s
Total time spent in calling the VM 3.931022e-02s (95.333%)
Total overhead (computing slices..) 1.924276e-03s (4.667%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
61.1% 61.1% 0.013s 4.84e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm
35.1% 96.2% 0.007s 1.85e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise
3.8% 100.0% 0.001s 3.01e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
61.1% 61.1% 0.013s 4.84e-05s C 264 2 GpuGemm{no_inplace}
12.1% 73.2% 0.003s 1.92e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
11.6% 84.8% 0.002s 1.84e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
11.4% 96.2% 0.002s 1.80e-05s C 132 1 GpuElemwise{mul,no_inplace}
2.1% 98.3% 0.000s 3.31e-06s C 132 1 GpuSubtensor{::, :int64:}
1.7% 100.0% 0.000s 2.72e-06s C 132 1 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
31.3% 31.3% 0.007s 4.96e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
29.8% 61.1% 0.006s 4.72e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
12.1% 73.2% 0.003s 1.92e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
11.6% 84.8% 0.002s 1.84e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
11.4% 96.2% 0.002s 1.80e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
2.1% 98.3% 0.000s 3.31e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
1.7% 100.0% 0.000s 2.72e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111
Time in 11 calls to Function.__call__: 2.407074e-03s
Time in Function.fn.__call__: 2.131939e-03s (88.570%)
Time in thunks: 4.451275e-04s (18.492%)
Total compile time: 2.637064e+01s
Number of Apply nodes: 8
Theano Optimizer time: 8.200908e-02s
Theano validate time: 1.047134e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.873722e+01s
Import time 9.109974e-03s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.952s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
45.8% 45.8% 0.000s 1.85e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu
19.9% 65.7% 0.000s 4.02e-06s C 22 2 theano.tensor.basic.Alloc
15.8% 81.5% 0.000s 3.20e-06s C 22 2 theano.compile.ops.Shape_i
6.9% 88.3% 0.000s 2.77e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuReshape
6.2% 94.5% 0.000s 2.49e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle
5.5% 100.0% 0.000s 2.23e-06s C 11 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
45.8% 45.8% 0.000s 1.85e-05s C 11 1 HostFromGpu
19.9% 65.7% 0.000s 4.02e-06s C 22 2 Alloc
15.8% 81.5% 0.000s 3.20e-06s C 22 2 Shape_i{0}
6.9% 88.3% 0.000s 2.77e-06s C 11 1 GpuReshape{2}
6.2% 94.5% 0.000s 2.49e-06s C 11 1 GpuDimShuffle{x,x,0}
5.5% 100.0% 0.000s 2.23e-06s C 11 1 MakeVector{dtype='int64'}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
45.8% 45.8% 0.000s 1.85e-05s 11 7 HostFromGpu(GpuReshape{2}.0)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(1, 100), strides=c
10.6% 56.4% 0.000s 4.29e-06s 11 4 Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 12), strides=c
9.4% 65.8% 0.000s 3.79e-06s 11 0 Shape_i{0}(generator_generate_attended)
input 0: dtype=float32, shape=(12, 1, 200), strides=c
output 0: dtype=int64, shape=(), strides=c
9.3% 75.0% 0.000s 3.75e-06s 11 1 Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200})
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int16, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
6.9% 81.9% 0.000s 2.77e-06s 11 6 GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
6.4% 88.3% 0.000s 2.60e-06s 11 2 Shape_i{0}(initial_state)
input 0: dtype=float32, shape=(100,), strides=c
output 0: dtype=int64, shape=(), strides=c
6.2% 94.5% 0.000s 2.49e-06s 11 3 GpuDimShuffle{x,x,0}(initial_state)
input 0: dtype=float32, shape=(100,), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
5.5% 100.0% 0.000s 2.23e-06s 11 5 MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(2,), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 1KB (1KB)
GPU: 0KB (0KB)
CPU + GPU: 1KB (1KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 1KB (1KB)
GPU: 0KB (0KB)
CPU + GPU: 1KB (1KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 1KB
GPU: 0KB
CPU + GPU: 1KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126
Time in 176 calls to Function.__call__: 4.303689e-01s
Time in Function.fn.__call__: 4.239194e-01s (98.501%)
Time in thunks: 1.613367e-01s (37.488%)
Total compile time: 9.262143e+00s
Number of Apply nodes: 79
Theano Optimizer time: 5.638268e-01s
Theano validate time: 2.706265e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 2.979231e-01s
Import time 1.633863e-01s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.954s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
21.9% 21.9% 0.035s 4.02e-05s C 880 5 theano.sandbox.cuda.blas.GpuDot22
20.7% 42.7% 0.033s 1.90e-05s C 1760 10 theano.sandbox.cuda.basic_ops.GpuElemwise
17.9% 60.6% 0.029s 4.10e-05s C 704 4 theano.sandbox.cuda.blas.GpuGemm
7.3% 67.8% 0.012s 1.66e-05s C 704 4 theano.sandbox.cuda.basic_ops.HostFromGpu
7.2% 75.0% 0.012s 2.19e-05s C 528 3 theano.sandbox.cuda.basic_ops.GpuCAReduce
7.1% 82.1% 0.012s 1.31e-05s C 880 5 theano.sandbox.cuda.basic_ops.GpuFromHost
4.8% 86.9% 0.008s 4.41e-05s C 176 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
3.5% 90.4% 0.006s 2.68e-06s C 2112 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle
2.2% 92.7% 0.004s 2.29e-06s C 1584 9 theano.tensor.elemwise.Elemwise
2.2% 94.9% 0.004s 2.94e-06s C 1232 7 theano.sandbox.cuda.basic_ops.GpuReshape
2.1% 97.1% 0.003s 2.45e-06s C 1408 8 theano.compile.ops.Shape_i
1.7% 98.8% 0.003s 2.28e-06s C 1232 7 theano.tensor.opt.MakeVector
0.7% 99.5% 0.001s 3.15e-06s C 352 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.3% 99.8% 0.000s 2.45e-06s C 176 1 theano.tensor.elemwise.All
0.2% 100.0% 0.000s 2.15e-06s C 176 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
21.9% 21.9% 0.035s 4.02e-05s C 880 5 GpuDot22
17.9% 39.8% 0.029s 4.10e-05s C 704 4 GpuGemm{inplace}
7.3% 47.1% 0.012s 1.66e-05s C 704 4 HostFromGpu
7.1% 54.2% 0.012s 1.31e-05s C 880 5 GpuFromHost
4.8% 59.0% 0.008s 4.41e-05s C 176 1 GpuAdvancedSubtensor1
2.6% 61.7% 0.004s 2.42e-05s C 176 1 GpuCAReduce{maximum}{1,0}
2.3% 64.0% 0.004s 2.12e-05s C 176 1 GpuCAReduce{add}{1,0,0}
2.2% 66.2% 0.004s 2.02e-05s C 176 1 GpuCAReduce{add}{1,0}
2.2% 68.4% 0.003s 1.98e-05s C 176 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)]
2.1% 70.5% 0.003s 1.96e-05s C 176 1 GpuElemwise{mul,no_inplace}
2.1% 72.6% 0.003s 1.94e-05s C 176 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)]
2.1% 74.7% 0.003s 1.93e-05s C 176 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
2.1% 76.8% 0.003s 1.92e-05s C 176 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
2.1% 78.9% 0.003s 1.92e-05s C 176 1 GpuElemwise{Mul}[(0, 1)]
2.0% 80.9% 0.003s 1.86e-05s C 176 1 GpuElemwise{Add}[(0, 0)]
2.0% 83.0% 0.003s 1.86e-05s C 176 1 GpuElemwise{TrueDiv}[(0, 0)]
2.0% 85.0% 0.003s 1.83e-05s C 176 1 GpuElemwise{Sub}[(0, 1)]
2.0% 86.9% 0.003s 1.82e-05s C 176 1 GpuElemwise{Tanh}[(0, 0)]
1.9% 88.8% 0.003s 2.89e-06s C 1056 6 GpuReshape{2}
1.7% 90.6% 0.003s 2.28e-06s C 1232 7 MakeVector{dtype='int64'}
... (remaining 23 Ops account for 9.43%(0.02s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.4% 5.4% 0.009s 4.94e-05s 176 37 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
4.9% 10.3% 0.008s 4.54e-05s 176 29 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
4.9% 15.2% 0.008s 4.49e-05s 176 48 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.8% 20.1% 0.008s 4.41e-05s 176 12 GpuAdvancedSubtensor1(W, readout_sample_samples)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(1,), strides=(16,)
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.3% 24.3% 0.007s 3.92e-05s 176 24 GpuDot22(GpuFromHost.0, state_to_gates)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
4.1% 28.4% 0.007s 3.77e-05s 176 47 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.1% 32.6% 0.007s 3.77e-05s 176 57 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 1), strides=(1, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
4.0% 36.6% 0.007s 3.70e-05s 176 51 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.0% 40.6% 0.006s 3.69e-05s 176 49 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.0% 44.6% 0.006s 3.68e-05s 176 33 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
2.6% 47.3% 0.004s 2.42e-05s 176 59 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(1,), strides=(0,)
2.3% 49.6% 0.004s 2.12e-05s 176 77 GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
2.2% 51.8% 0.004s 2.02e-05s 176 63 GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(1,), strides=(0,)
2.2% 53.9% 0.003s 1.98e-05s 176 54 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 2: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
2.1% 56.1% 0.003s 1.96e-05s 176 46 GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.1% 58.2% 0.003s 1.94e-05s 176 50 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(1, 100), strides=(0, 1)
input 4: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.1% 60.3% 0.003s 1.93e-05s 176 38 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
2.1% 62.4% 0.003s 1.92e-05s 176 61 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
input 2: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
2.1% 64.5% 0.003s 1.92e-05s 176 76 GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0)
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
2.0% 66.5% 0.003s 1.86e-05s 176 72 GpuElemwise{TrueDiv}[(0, 0)](GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0, GpuElemwise{Add}[(0, 0)].0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
... (remaining 59 Apply instances account for 33.48%(0.05s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 1KB (1KB)
GPU: 14KB (16KB)
CPU + GPU: 15KB (18KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 1KB (1KB)
GPU: 14KB (16KB)
CPU + GPU: 15KB (18KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 1KB
GPU: 18KB
CPU + GPU: 20KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
80000B [(200, 100)] v GpuDimShuffle{0,1}(W)
80000B [(200, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] v GpuDimShuffle{0,1,2}(GpuFromHost.0)
9600B [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
9600B [(12, 1, 200)] c GpuFromHost(generator_generate_attended)
9600B [(12, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
4800B [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
4800B [(12, 1, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0)
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
4800B [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0)
4800B [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
... (remaining 67 Apply account for 13955B/241155B ((5.79%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137
Time in 176 calls to Function.__call__: 1.020806e-01s
Time in Function.fn.__call__: 9.711361e-02s (95.134%)
Time in thunks: 4.424906e-02s (43.347%)
Total compile time: 5.741991e+00s
Number of Apply nodes: 14
Theano Optimizer time: 1.551719e-01s
Theano validate time: 3.836393e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 6.299686e-02s
Import time 3.151894e-02s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.967s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
29.7% 29.7% 0.013s 1.87e-05s C 704 4 theano.sandbox.cuda.basic_ops.GpuElemwise
18.0% 47.7% 0.008s 4.51e-05s C 176 1 theano.sandbox.cuda.blas.GpuGemm
16.2% 63.9% 0.007s 2.04e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuCAReduce
15.7% 79.6% 0.007s 3.94e-05s C 176 1 theano.sandbox.cuda.blas.GpuDot22
10.4% 90.0% 0.005s 1.31e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuFromHost
6.8% 96.8% 0.003s 1.71e-05s C 176 1 theano.sandbox.cuda.basic_ops.HostFromGpu
3.2% 100.0% 0.001s 2.67e-06s C 528 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
18.0% 18.0% 0.008s 4.51e-05s C 176 1 GpuGemm{inplace}
15.7% 33.6% 0.007s 3.94e-05s C 176 1 GpuDot22
10.4% 44.0% 0.005s 1.31e-05s C 352 2 GpuFromHost
8.4% 52.4% 0.004s 2.10e-05s C 176 1 GpuCAReduce{maximum}{0,1}
8.0% 60.4% 0.004s 2.00e-05s C 176 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
7.9% 68.2% 0.003s 1.98e-05s C 176 1 GpuCAReduce{add}{0,1}
7.4% 75.6% 0.003s 1.86e-05s C 176 1 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
7.4% 83.0% 0.003s 1.86e-05s C 176 1 GpuElemwise{Add}[(0, 1)]
7.0% 90.0% 0.003s 1.76e-05s C 176 1 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)]
6.8% 96.8% 0.003s 1.71e-05s C 176 1 HostFromGpu
2.0% 98.8% 0.001s 2.55e-06s C 352 2 GpuDimShuffle{0,x}
1.2% 100.0% 0.001s 2.90e-06s C 176 1 GpuDimShuffle{x,0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
18.0% 18.0% 0.008s 4.51e-05s 176 4 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
15.7% 33.6% 0.007s 3.94e-05s 176 3 GpuDot22(GpuFromHost.0, W)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
8.4% 42.0% 0.004s 2.10e-05s 176 6 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1,), strides=(0,)
8.0% 49.9% 0.004s 2.00e-05s 176 8 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
7.9% 57.8% 0.003s 1.98e-05s 176 9 GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1,), strides=(0,)
7.4% 65.2% 0.003s 1.86e-05s 176 11 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0)
input 0: dtype=float32, shape=(1, 1), strides=(0, 0)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
output 0: dtype=float32, shape=(1, 1), strides=(0, 0)
7.4% 72.6% 0.003s 1.86e-05s 176 5 GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
7.0% 79.6% 0.003s 1.76e-05s 176 12 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
6.8% 86.4% 0.003s 1.71e-05s 176 13 HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1, 44), strides=c
6.0% 92.4% 0.003s 1.51e-05s 176 0 GpuFromHost(generator_generate_weighted_averages)
input 0: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
4.4% 96.8% 0.002s 1.11e-05s 176 1 GpuFromHost(generator_generate_states)
input 0: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
1.2% 98.0% 0.001s 2.90e-06s 176 2 GpuDimShuffle{x,0}(b)
input 0: dtype=float32, shape=(44,), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
1.1% 99.0% 0.000s 2.65e-06s 176 7 GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0)
input 0: dtype=float32, shape=(1,), strides=(0,)
output 0: dtype=float32, shape=(1, 1), strides=(0, 0)
1.0% 100.0% 0.000s 2.46e-06s 176 10 GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0)
input 0: dtype=float32, shape=(1,), strides=(0,)
output 0: dtype=float32, shape=(1, 1), strides=(0, 0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 1KB (1KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 1KB (1KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 1.502037e-05s
Time in Function.fn.__call__: 6.198883e-06s (41.270%)
Total compile time: 5.506201e+00s
Number of Apply nodes: 0
Theano Optimizer time: 1.379991e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.969337e-04s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.970s
No execution time accumulated (hint: try config profiling.time_thunks=1)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
Time in 6075 calls to Function.__call__: 3.570089e-01s
Time in Function.fn.__call__: 2.095068e-01s (58.684%)
Time in thunks: 3.889871e-02s (10.896%)
Total compile time: 7.101128e+00s
Number of Apply nodes: 2
Theano Optimizer time: 1.691389e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 2.875090e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.970s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.039s 3.20e-06s C 12150 2 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.039s 3.20e-06s C 12150 2 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
59.5% 59.5% 0.023s 3.81e-06s 6075 0 DeepCopyOp(labels)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
40.5% 100.0% 0.016s 2.59e-06s 6075 1 DeepCopyOp(inputs)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 2 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253
Time in 100 calls to Function.__call__: 9.018593e+01s
Time in Function.fn.__call__: 8.999730e+01s (99.791%)
Time in thunks: 3.194728e+01s (35.424%)
Total compile time: 3.881262e+02s
Number of Apply nodes: 3574
Theano Optimizer time: 2.013044e+02s
Theano validate time: 4.291104e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.755457e+02s
Import time 1.107465e+01s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 830.971s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
82.1% 82.1% 26.232s 3.75e-02s Py 700 7 theano.scan_module.scan_op.Scan
5.7% 87.8% 1.812s 2.16e-05s C 83700 837 theano.sandbox.cuda.basic_ops.GpuElemwise
3.1% 90.9% 1.002s 1.00e-02s Py 100 1 lvsr.ops.EditDistanceOp
2.4% 93.3% 0.761s 3.08e-05s C 24700 247 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.0% 94.3% 0.330s 4.65e-05s C 7100 71 theano.sandbox.cuda.blas.GpuDot22
1.0% 95.3% 0.313s 3.64e-06s C 86000 860 theano.tensor.elemwise.Elemwise
0.9% 96.2% 0.291s 1.82e-05s C 16000 160 theano.sandbox.cuda.basic_ops.HostFromGpu
0.5% 96.8% 0.173s 2.54e-05s C 6800 68 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.5% 97.3% 0.166s 2.30e-05s Py 7200 48 theano.ifelse.IfElse
0.4% 97.7% 0.142s 2.57e-05s C 5500 55 theano.sandbox.cuda.basic_ops.GpuAlloc
0.4% 98.2% 0.139s 3.32e-06s C 41800 418 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.4% 98.6% 0.129s 7.69e-06s C 16800 168 theano.sandbox.cuda.basic_ops.GpuReshape
0.2% 98.8% 0.063s 2.11e-05s C 3000 30 theano.compile.ops.DeepCopyOp
0.1% 98.9% 0.048s 4.30e-06s C 11100 111 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 99.1% 0.048s 3.69e-06s C 12900 129 theano.tensor.opt.MakeVector
0.1% 99.2% 0.039s 1.68e-05s C 2300 23 theano.sandbox.cuda.basic_ops.GpuFromHost
0.1% 99.3% 0.036s 3.41e-06s C 10600 106 theano.compile.ops.Shape_i
0.1% 99.4% 0.030s 9.88e-05s Py 300 3 theano.sandbox.cuda.basic_ops.GpuSplit
0.1% 99.5% 0.026s 6.61e-05s C 400 4 theano.sandbox.cuda.basic_ops.GpuJoin
0.1% 99.5% 0.025s 2.81e-06s C 8800 88 theano.tensor.basic.ScalarFromTensor
... (remaining 24 Classes account for 0.45%(0.14s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
29.2% 29.2% 9.321s 9.32e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
19.3% 48.5% 6.165s 6.16e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}
14.4% 62.9% 4.615s 2.31e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
11.5% 74.4% 3.680s 3.68e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
5.0% 79.4% 1.599s 1.60e-02s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
3.1% 82.6% 1.002s 1.00e-02s Py 100 1 EditDistanceOp
2.7% 85.2% 0.851s 8.51e-03s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
1.0% 86.3% 0.330s 4.65e-05s C 7100 71 GpuDot22
1.0% 87.3% 0.319s 3.80e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1}
0.9% 88.2% 0.291s 1.82e-05s C 16000 160 HostFromGpu
0.6% 88.8% 0.207s 2.13e-05s C 9700 97 GpuElemwise{add,no_inplace}
0.5% 89.4% 0.172s 2.20e-05s C 7800 78 GpuElemwise{sub,no_inplace}
0.5% 89.9% 0.171s 3.57e-05s C 4800 48 GpuCAReduce{add}{1,1}
0.5% 90.4% 0.155s 2.38e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
0.5% 90.9% 0.154s 2.49e-05s Py 6200 39 if{gpu}
0.5% 91.3% 0.150s 2.34e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
0.4% 91.8% 0.134s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
0.4% 92.2% 0.133s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
0.4% 92.6% 0.133s 2.29e-05s C 5800 58 GpuElemwise{Switch,no_inplace}
0.4% 93.0% 0.131s 1.99e-05s C 6600 66 GpuElemwise{Mul}[(0, 0)]
... (remaining 262 Ops account for 6.99%(2.23s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
29.2% 29.2% 9.321s 9.32e-02s 100 2406 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 12), strides=c
input 2: dtype=float32, shape=(15, 10, 200), strides=c
input 3: dtype=float32, shape=(15, 10, 100), strides=c
input 4: dtype=float32, shape=(15, 10, 100), strides=c
input 5: dtype=float32, shape=(15, 10, 100), strides=c
input 6: dtype=float32, shape=(15, 10, 1), strides=c
input 7: dtype=float32, shape=(15, 10, 200), strides=c
input 8: dtype=float32, shape=(15, 10, 12), strides=c
input 9: dtype=float32, shape=(15, 10, 200), strides=c
input 10: dtype=float32, shape=(15, 10, 100), strides=c
input 11: dtype=float32, shape=(15, 10, 100), strides=c
input 12: dtype=float32, shape=(15, 10, 100), strides=c
input 13: dtype=float32, shape=(15, 10, 200), strides=c
input 14: dtype=float32, shape=(16, 10, 100), strides=c
input 15: dtype=float32, shape=(16, 10, 200), strides=c
input 16: dtype=float32, shape=(16, 10, 12), strides=c
input 17: dtype=float32, shape=(16, 10, 100), strides=c
input 18: dtype=float32, shape=(16, 10, 200), strides=c
input 19: dtype=float32, shape=(16, 10, 12), strides=c
input 20: dtype=float32, shape=(2, 100, 1), strides=c
input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c
input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c
input 23: dtype=float32, shape=(2, 100, 1), strides=c
input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c
input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c
input 26: dtype=int64, shape=(), strides=c
input 27: dtype=int64, shape=(), strides=c
input 28: dtype=int64, shape=(), strides=c
input 29: dtype=int64, shape=(), strides=c
input 30: dtype=int64, shape=(), strides=c
input 31: dtype=int64, shape=(), strides=c
input 32: dtype=int64, shape=(), strides=c
input 33: dtype=int64, shape=(), strides=c
input 34: dtype=float32, shape=(100, 200), strides=c
input 35: dtype=float32, shape=(200, 200), strides=c
input 36: dtype=float32, shape=(100, 100), strides=c
input 37: dtype=float32, shape=(200, 100), strides=c
input 38: dtype=float32, shape=(100, 100), strides=c
input 39: dtype=float32, shape=(200, 200), strides=c
input 40: dtype=float32, shape=(200, 100), strides=c
input 41: dtype=float32, shape=(100, 100), strides=c
input 42: dtype=float32, shape=(100, 200), strides=c
input 43: dtype=float32, shape=(100, 100), strides=c
input 44: dtype=int64, shape=(2,), strides=c
input 45: dtype=float32, shape=(12, 10, 100), strides=c
input 46: dtype=int64, shape=(1,), strides=c
input 47: dtype=float32, shape=(12, 10), strides=c
input 48: dtype=float32, shape=(12, 10, 200), strides=c
input 49: dtype=float32, shape=(100, 1), strides=c
input 50: dtype=int8, shape=(10,), strides=c
input 51: dtype=float32, shape=(1, 100), strides=c
input 52: dtype=float32, shape=(100, 200), strides=c
input 53: dtype=float32, shape=(200, 200), strides=c
input 54: dtype=float32, shape=(100, 100), strides=c
input 55: dtype=float32, shape=(200, 100), strides=c
input 56: dtype=float32, shape=(100, 100), strides=c
input 57: dtype=float32, shape=(200, 200), strides=c
input 58: dtype=float32, shape=(200, 100), strides=c
input 59: dtype=float32, shape=(100, 100), strides=c
input 60: dtype=float32, shape=(100, 200), strides=c
input 61: dtype=float32, shape=(100, 100), strides=c
input 62: dtype=int64, shape=(2,), strides=c
input 63: dtype=float32, shape=(12, 10, 100), strides=c
input 64: dtype=int64, shape=(1,), strides=c
input 65: dtype=float32, shape=(12, 10), strides=c
input 66: dtype=float32, shape=(12, 10, 200), strides=c
input 67: dtype=float32, shape=(100, 1), strides=c
input 68: dtype=int8, shape=(10,), strides=c
input 69: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(16, 10, 100), strides=c
output 1: dtype=float32, shape=(16, 10, 200), strides=c
output 2: dtype=float32, shape=(16, 10, 12), strides=c
output 3: dtype=float32, shape=(16, 10, 100), strides=c
output 4: dtype=float32, shape=(16, 10, 200), strides=c
output 5: dtype=float32, shape=(16, 10, 12), strides=c
output 6: dtype=float32, shape=(2, 100, 1), strides=c
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c
output 9: dtype=float32, shape=(2, 100, 1), strides=c
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c
output 12: dtype=float32, shape=(15, 10, 100), strides=c
output 13: dtype=float32, shape=(15, 10, 200), strides=c
output 14: dtype=float32, shape=(15, 10, 100), strides=c
output 15: dtype=float32, shape=(15, 100, 10), strides=c
output 16: dtype=float32, shape=(15, 10, 100), strides=c
output 17: dtype=float32, shape=(15, 10, 200), strides=c
output 18: dtype=float32, shape=(15, 10, 100), strides=c
output 19: dtype=float32, shape=(15, 100, 10), strides=c
19.3% 48.5% 6.165s 6.16e-02s 100 1795 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=c
input 2: dtype=float32, shape=(1, 10, 200), strides=c
input 3: dtype=float32, shape=(1, 92160), strides=c
input 4: dtype=float32, shape=(1, 10, 100), strides=c
input 5: dtype=float32, shape=(1, 10, 200), strides=c
input 6: dtype=float32, shape=(2, 92160), strides=c
input 7: dtype=int64, shape=(), strides=c
input 8: dtype=int64, shape=(), strides=c
input 9: dtype=float32, shape=(100, 44), strides=c
input 10: dtype=float32, shape=(200, 44), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(200, 200), strides=c
input 13: dtype=float32, shape=(45, 100), strides=c
input 14: dtype=float32, shape=(100, 200), strides=c
input 15: dtype=float32, shape=(100, 100), strides=c
input 16: dtype=float32, shape=(200, 100), strides=c
input 17: dtype=float32, shape=(100, 100), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(1, 44), strides=c
input 20: dtype=float32, shape=(1, 200), strides=c
input 21: dtype=float32, shape=(1, 100), strides=c
input 22: dtype=int64, shape=(1,), strides=c
input 23: dtype=float32, shape=(12, 10), strides=c
input 24: dtype=float32, shape=(12, 10, 200), strides=c
input 25: dtype=float32, shape=(100, 1), strides=c
input 26: dtype=int8, shape=(10,), strides=c
input 27: dtype=float32, shape=(12, 10, 100), strides=c
input 28: dtype=float32, shape=(12, 10, 200), strides=c
input 29: dtype=float32, shape=(12, 10, 100), strides=c
output 0: dtype=float32, shape=(1, 10, 100), strides=c
output 1: dtype=float32, shape=(1, 10, 200), strides=c
output 2: dtype=float32, shape=(1, 92160), strides=c
output 3: dtype=float32, shape=(1, 10, 100), strides=c
output 4: dtype=float32, shape=(1, 10, 200), strides=c
output 5: dtype=float32, shape=(2, 92160), strides=c
output 6: dtype=int64, shape=(15, 10), strides=c
output 7: dtype=int64, shape=(15, 10), strides=c
11.5% 60.0% 3.680s 3.68e-02s 100 2157 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 200), strides=c
input 2: dtype=float32, shape=(15, 10, 100), strides=c
input 3: dtype=float32, shape=(15, 10, 1), strides=c
input 4: dtype=float32, shape=(15, 10, 200), strides=c
input 5: dtype=float32, shape=(15, 10, 100), strides=c
input 6: dtype=float32, shape=(16, 10, 100), strides=c
input 7: dtype=float32, shape=(16, 10, 200), strides=c
input 8: dtype=float32, shape=(16, 10, 12), strides=c
input 9: dtype=float32, shape=(16, 10, 100), strides=c
input 10: dtype=float32, shape=(16, 10, 200), strides=c
input 11: dtype=float32, shape=(16, 10, 12), strides=c
input 12: dtype=float32, shape=(100, 200), strides=c
input 13: dtype=float32, shape=(200, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(200, 100), strides=c
input 16: dtype=float32, shape=(100, 100), strides=c
input 17: dtype=float32, shape=(12, 10), strides=c
input 18: dtype=float32, shape=(12, 10, 100), strides=c
input 19: dtype=int64, shape=(1,), strides=c
input 20: dtype=float32, shape=(12, 10, 200), strides=c
input 21: dtype=int8, shape=(10,), strides=c
input 22: dtype=float32, shape=(100, 1), strides=c
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(200, 200), strides=c
input 25: dtype=float32, shape=(100, 100), strides=c
input 26: dtype=float32, shape=(200, 100), strides=c
input 27: dtype=float32, shape=(100, 100), strides=c
input 28: dtype=float32, shape=(12, 10), strides=c
input 29: dtype=float32, shape=(12, 10, 100), strides=c
input 30: dtype=int64, shape=(1,), strides=c
input 31: dtype=float32, shape=(12, 10, 200), strides=c
input 32: dtype=int8, shape=(10,), strides=c
input 33: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(16, 10, 100), strides=c
output 1: dtype=float32, shape=(16, 10, 200), strides=c
output 2: dtype=float32, shape=(16, 10, 12), strides=c
output 3: dtype=float32, shape=(16, 10, 100), strides=c
output 4: dtype=float32, shape=(16, 10, 200), strides=c
output 5: dtype=float32, shape=(16, 10, 12), strides=c
7.2% 67.2% 2.311s 2.31e-02s 100 2602 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0,
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 100), strides=c
input 4: dtype=float32, shape=(12, 10, 1), strides=c
input 5: dtype=float32, shape=(12, 10, 200), strides=c
input 6: dtype=float32, shape=(12, 10, 100), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(12, 10, 1), strides=c
input 9: dtype=float32, shape=(13, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=int64, shape=(), strides=c
input 12: dtype=int64, shape=(), strides=c
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=float32, shape=(100, 200), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(200, 100), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(100, 200), strides=c
input 22: dtype=float32, shape=(100, 100), strides=c
input 23: dtype=float32, shape=(200, 100), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(12, 10, 200), strides=c
output 4: dtype=float32, shape=(12, 100, 10), strides=c
output 5: dtype=float32, shape=(12, 10, 100), strides=c
output 6: dtype=float32, shape=(12, 10, 200), strides=c
output 7: dtype=float32, shape=(12, 100, 10), strides=c
7.2% 74.4% 2.305s 2.30e-02s 100 2603 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 100), strides=c
input 4: dtype=float32, shape=(12, 10, 1), strides=c
input 5: dtype=float32, shape=(12, 10, 200), strides=c
input 6: dtype=float32, shape=(12, 10, 100), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(12, 10, 1), strides=c
input 9: dtype=float32, shape=(13, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=int64, shape=(), strides=c
input 12: dtype=int64, shape=(), strides=c
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=float32, shape=(100, 200), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(200, 100), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(100, 200), strides=c
input 22: dtype=float32, shape=(100, 100), strides=c
input 23: dtype=float32, shape=(200, 100), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(12, 10, 200), strides=c
output 4: dtype=float32, shape=(12, 100, 10), strides=c
output 5: dtype=float32, shape=(12, 10, 100), strides=c
output 6: dtype=float32, shape=(12, 10, 200), strides=c
output 7: dtype=float32, shape=(12, 100, 10), strides=c
5.0% 79.4% 1.599s 1.60e-02s 100 1601 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 1), strides=c
input 4: dtype=float32, shape=(12, 10, 200), strides=c
input 5: dtype=float32, shape=(12, 10, 100), strides=c
input 6: dtype=float32, shape=(12, 10, 1), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(13, 10, 100), strides=c
input 9: dtype=float32, shape=(12, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(13, 10, 100), strides=c
3.1% 82.6% 1.002s 1.00e-02s 100 1861 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10)
input 0: dtype=int64, shape=(15, 10), strides=c
input 1: dtype=float32, shape=(15, 10), strides=c
input 2: dtype=int64, shape=(12, 10), strides=c
input 3: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(15, 10, 1), strides=c
2.7% 85.2% 0.851s 8.51e-03s 100 1611 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 1), strides=c
input 4: dtype=float32, shape=(12, 10, 200), strides=c
input 5: dtype=float32, shape=(12, 10, 100), strides=c
input 6: dtype=float32, shape=(12, 10, 1), strides=c
input 7: dtype=float32, shape=(13, 10, 100), strides=c
input 8: dtype=float32, shape=(13, 10, 100), strides=c
input 9: dtype=float32, shape=(100, 200), strides=c
input 10: dtype=float32, shape=(100, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
0.0% 85.3% 0.011s 1.11e-04s 100 2572 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(12, 10, 100), strides=c
0.0% 85.3% 0.010s 1.05e-04s 100 2573 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(12, 10, 100), strides=c
0.0% 85.3% 0.008s 8.06e-05s 100 2356 GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(15, 10), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(14, 10), strides=c
output 1: dtype=float32, shape=(1, 10), strides=c
0.0% 85.4% 0.007s 7.49e-05s 100 1739 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 100), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
0.0% 85.4% 0.007s 7.41e-05s 100 1731 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 100), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
0.0% 85.4% 0.007s 7.37e-05s 100 1682 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 100), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
0.0% 85.4% 0.007s 7.11e-05s 100 2477 GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.0% 85.5% 0.007s 6.96e-05s 100 3110 GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.0% 85.5% 0.007s 6.84e-05s 100 2488 GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.0% 85.5% 0.007s 6.74e-05s 100 3370 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.0% 85.5% 0.007s 6.71e-05s 100 3367 GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.0% 85.5% 0.007s 6.63e-05s 100 3565 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0)
input 0: dtype=float32, shape=(200, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 3554 Apply instances account for 14.46%(4.62s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 57KB (61KB)
GPU: 4979KB (6661KB)
CPU + GPU: 5035KB (6721KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 56KB (61KB)
GPU: 6160KB (7107KB)
CPU + GPU: 6216KB (7167KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 115KB
GPU: 16958KB
CPU + GPU: 17073KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
1132320B [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
399360B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12)] i i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0)
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan&generator_generate_scan}.5, ScalarFromTensor.0)
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
368640B [(1, 92160)] c GpuAllocEmpty(TensorConstant{1}, Shape_i{0}.0)
368640B [(1, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, GpuDimShuffle{x,0}.0, Constant{1})
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
200000B [(12, 10, 100), (13, 10, 100), (12, 10, 100), (13, 10, 100)] i i i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0)
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0)
160000B [(200, 200)] v GpuDimShuffle{1,0}(W)
160000B [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0)
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
160000B [(200, 200)] v GpuDimShuffle{1,0}(W)
160000B [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{Composite{EQ(i0, Switch(i1, (i2 // (-(i3 * i4 * i5))), i3))}}.0)
... (remaining 3554 Apply account for 51935459B/60721859B ((85.53%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 1.585470e+00s
Total time spent in calling the VM 1.543819e+00s (97.373%)
Total overhead (computing slices..) 4.165101e-02s (2.627%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
57.3% 57.3% 0.500s 5.21e-05s C 9600 8 theano.sandbox.cuda.blas.GpuGemm
39.2% 96.5% 0.343s 2.04e-05s C 16800 14 theano.sandbox.cuda.basic_ops.GpuElemwise
3.5% 100.0% 0.031s 3.19e-06s C 9600 8 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
57.3% 57.3% 0.500s 5.21e-05s C 9600 8 GpuGemm{no_inplace}
12.5% 69.7% 0.109s 2.27e-05s C 4800 4 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
10.8% 80.5% 0.094s 1.96e-05s C 4800 4 GpuElemwise{mul,no_inplace}
10.6% 91.1% 0.093s 1.93e-05s C 4800 4 GpuElemwise{ScalarSigmoid}[(0, 0)]
5.4% 96.5% 0.047s 1.96e-05s C 2400 2 GpuElemwise{sub,no_inplace}
1.9% 98.4% 0.017s 3.49e-06s C 4800 4 GpuSubtensor{::, :int64:}
1.6% 100.0% 0.014s 2.89e-06s C 4800 4 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
7.3% 7.3% 0.063s 5.28e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
7.2% 14.5% 0.063s 5.25e-05s 1200 4 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
7.2% 21.7% 0.063s 5.25e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
7.2% 28.8% 0.063s 5.21e-05s 1200 5 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
7.1% 36.0% 0.062s 5.18e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.1% 43.1% 0.062s 5.17e-05s 1200 23 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.1% 50.2% 0.062s 5.16e-05s 1200 25 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.1% 57.3% 0.062s 5.16e-05s 1200 24 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.2% 60.4% 0.028s 2.31e-05s 1200 26 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.1% 63.5% 0.027s 2.27e-05s 1200 27 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.1% 66.7% 0.027s 2.26e-05s 1200 28 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.1% 69.7% 0.027s 2.24e-05s 1200 29 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.9% 72.6% 0.025s 2.09e-05s 1200 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 1), strides=c
2.7% 75.3% 0.024s 1.97e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 78.0% 0.024s 1.96e-05s 1200 21 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 80.7% 0.023s 1.96e-05s 1200 20 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 83.4% 0.023s 1.95e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 86.1% 0.023s 1.95e-05s 1200 6 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.7% 88.7% 0.023s 1.93e-05s 1200 8 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.6% 91.3% 0.023s 1.92e-05s 1200 7 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
... (remaining 10 Apply instances account for 8.66%(0.08s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 27KB (51KB)
CPU + GPU: 27KB (51KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 27KB (51KB)
CPU + GPU: 27KB (51KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 78KB
CPU + GPU: 78KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
... (remaining 10 Apply account for 32080B/144080B ((22.27%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 8.424067e-01s
Total time spent in calling the VM 8.255837e-01s (98.003%)
Total overhead (computing slices..) 1.682305e-02s (1.997%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
54.5% 54.5% 0.248s 5.17e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm
42.3% 96.7% 0.193s 2.01e-05s C 9600 8 theano.sandbox.cuda.basic_ops.GpuElemwise
3.3% 100.0% 0.015s 3.10e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
54.5% 54.5% 0.248s 5.17e-05s C 4800 4 GpuGemm{no_inplace}
11.7% 66.2% 0.054s 2.23e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
10.3% 76.5% 0.047s 1.95e-05s C 2400 2 GpuElemwise{mul,no_inplace}
10.2% 86.7% 0.047s 1.94e-05s C 2400 2 GpuElemwise{sub,no_inplace}
10.0% 96.7% 0.046s 1.91e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
1.8% 98.5% 0.008s 3.35e-06s C 2400 2 GpuSubtensor{::, :int64:}
1.5% 100.0% 0.007s 2.85e-06s C 2400 2 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
13.8% 13.8% 0.063s 5.24e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
13.6% 27.4% 0.062s 5.17e-05s 1200 3 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
13.6% 40.9% 0.062s 5.15e-05s 1200 13 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
13.5% 54.5% 0.062s 5.14e-05s 1200 12 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.9% 60.4% 0.027s 2.25e-05s 1200 14 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.8% 66.2% 0.027s 2.22e-05s 1200 15 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.4% 71.6% 0.025s 2.04e-05s 1200 0 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 1), strides=c
5.1% 76.7% 0.023s 1.95e-05s 1200 10 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.1% 81.9% 0.023s 1.95e-05s 1200 11 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.0% 86.9% 0.023s 1.91e-05s 1200 4 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
5.0% 91.9% 0.023s 1.90e-05s 1200 5 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
4.8% 96.7% 0.022s 1.84e-05s 1200 2 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 1), strides=c
0.9% 97.6% 0.004s 3.39e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
0.9% 98.5% 0.004s 3.31e-06s 1200 8 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
0.8% 99.3% 0.004s 2.97e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
0.7% 100.0% 0.003s 2.74e-06s 1200 9 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 39KB
CPU + GPU: 39KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
... (remaining 2 Apply account for 80B/72080B ((0.11%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( generator_generate_scan&generator_generate_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 6.135115e+00s
Total time spent in calling the VM 5.882160e+00s (95.877%)
Total overhead (computing slices..) 2.529552e-01s (4.123%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
26.5% 26.5% 0.788s 2.02e-05s C 39000 26 theano.sandbox.cuda.basic_ops.GpuElemwise
22.6% 49.0% 0.672s 4.48e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm
19.1% 68.1% 0.569s 3.79e-05s C 15000 10 theano.sandbox.cuda.blas.GpuDot22
10.9% 79.0% 0.325s 2.16e-05s C 15000 10 theano.sandbox.cuda.basic_ops.GpuCAReduce
4.5% 83.6% 0.135s 4.51e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
4.0% 87.5% 0.118s 3.93e-05s C 3000 2 theano.sandbox.rng_mrg.GPU_mrg_uniform
3.8% 91.3% 0.113s 1.89e-05s C 6000 4 theano.sandbox.cuda.basic_ops.HostFromGpu
1.8% 93.1% 0.052s 2.49e-06s C 21000 14 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.7% 94.8% 0.050s 1.67e-05s C 3000 2 theano.tensor.basic.MaxAndArgmax
1.1% 95.9% 0.034s 2.25e-06s C 15000 10 theano.compile.ops.Shape_i
1.0% 96.9% 0.029s 3.17e-06s C 9000 6 theano.sandbox.cuda.basic_ops.GpuReshape
0.8% 97.7% 0.023s 1.55e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuFromHost
0.8% 98.4% 0.023s 1.90e-06s C 12000 8 theano.tensor.opt.MakeVector
0.7% 99.1% 0.020s 3.31e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.4% 99.5% 0.013s 4.27e-06s C 3000 2 theano.sandbox.multinomial.MultinomialFromUniform
0.3% 99.8% 0.009s 2.11e-06s C 4500 3 theano.tensor.elemwise.Elemwise
0.2% 100.0% 0.005s 3.28e-06s C 1500 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
22.6% 22.6% 0.672s 4.48e-05s C 15000 10 GpuGemm{inplace}
19.1% 41.7% 0.569s 3.79e-05s C 15000 10 GpuDot22
4.5% 46.2% 0.135s 4.51e-05s C 3000 2 GpuAdvancedSubtensor1
4.2% 50.4% 0.124s 2.07e-05s C 6000 4 GpuElemwise{mul,no_inplace}
4.0% 54.3% 0.118s 3.93e-05s C 3000 2 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
3.8% 58.1% 0.113s 1.89e-05s C 6000 4 HostFromGpu
2.6% 60.7% 0.076s 2.54e-05s C 3000 2 GpuCAReduce{maximum}{1,0}
2.5% 63.2% 0.076s 2.52e-05s C 3000 2 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
2.3% 65.6% 0.070s 2.33e-05s C 3000 2 GpuCAReduce{add}{1,0,0}
2.1% 67.7% 0.063s 2.10e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
2.1% 69.8% 0.062s 2.07e-05s C 3000 2 GpuElemwise{add,no_inplace}
2.1% 71.8% 0.061s 2.04e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
2.0% 73.9% 0.061s 2.02e-05s C 3000 2 GpuCAReduce{maximum}{0,1}
2.0% 75.9% 0.060s 1.99e-05s C 3000 2 GpuCAReduce{add}{1,0}
2.0% 77.9% 0.059s 1.96e-05s C 3000 2 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
2.0% 79.8% 0.059s 1.96e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)]
1.9% 81.8% 0.058s 1.93e-05s C 3000 2 GpuCAReduce{add}{0,1}
1.9% 83.7% 0.057s 1.91e-05s C 3000 2 GpuElemwise{Add}[(0, 1)]
1.9% 85.6% 0.057s 1.91e-05s C 3000 2 GpuElemwise{Add}[(0, 0)]
1.9% 87.5% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
... (remaining 21 Ops account for 12.45%(0.37s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
2.4% 2.4% 0.073s 4.84e-05s 1500 20 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.4% 4.9% 0.072s 4.82e-05s 1500 21 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.4% 7.3% 0.072s 4.81e-05s 1500 75 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 9.7% 0.072s 4.80e-05s 1500 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 12.1% 0.071s 4.75e-05s 1500 16 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 44), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
2.4% 14.5% 0.071s 4.74e-05s 1500 18 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 44), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
2.3% 16.8% 0.069s 4.59e-05s 1500 57 GpuAdvancedSubtensor1(W_copy01[cuda], argmax)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(10,), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.2% 19.0% 0.067s 4.44e-05s 1500 59 GpuAdvancedSubtensor1(W_copy01[cuda], argmax)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(10,), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.1% 21.2% 0.064s 4.25e-05s 1500 1 GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
2.0% 23.2% 0.061s 4.05e-05s 1500 64 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.0% 25.3% 0.061s 4.04e-05s 1500 63 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.0% 27.3% 0.059s 3.96e-05s 1500 77 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.0% 29.2% 0.059s 3.95e-05s 1500 78 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.0% 31.2% 0.059s 3.94e-05s 1500 28 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(92160,), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=float32, shape=(92160,), strides=c
output 1: dtype=float32, shape=(10,), strides=c
2.0% 33.2% 0.059s 3.92e-05s 1500 26 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(92160,), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=float32, shape=(92160,), strides=c
output 1: dtype=float32, shape=(10,), strides=c
1.9% 35.1% 0.058s 3.84e-05s 1500 8 GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.9% 37.1% 0.057s 3.83e-05s 1500 3 GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
1.9% 39.0% 0.057s 3.82e-05s 1500 9 GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.9% 40.9% 0.056s 3.72e-05s 1500 82 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.9% 42.7% 0.056s 3.72e-05s 1500 81 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
... (remaining 95 Apply instances account for 57.27%(1.71s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 9KB (9KB)
GPU: 837KB (923KB)
CPU + GPU: 846KB (932KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 9KB (9KB)
GPU: 837KB (923KB)
CPU + GPU: 846KB (932KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 11KB
GPU: 1080KB
CPU + GPU: 1091KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda])
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda])
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda])
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda])
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
... (remaining 95 Apply account for 138458B/1515818B ((9.13%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 3.657264e+00s
Total time spent in calling the VM 3.536357e+00s (96.694%)
Total overhead (computing slices..) 1.209071e-01s (3.306%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
34.1% 34.1% 0.573s 2.01e-05s C 28500 19 theano.sandbox.cuda.basic_ops.GpuElemwise
26.7% 60.8% 0.448s 3.73e-05s C 12000 8 theano.sandbox.cuda.blas.GpuDot22
17.2% 77.9% 0.289s 4.81e-05s C 6000 4 theano.sandbox.cuda.blas.GpuGemm
11.4% 89.3% 0.191s 2.13e-05s C 9000 6 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.7% 92.1% 0.046s 1.53e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuFromHost
2.5% 94.6% 0.042s 2.33e-06s C 18000 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.2% 95.7% 0.020s 2.20e-06s C 9000 6 theano.compile.ops.Shape_i
1.1% 96.9% 0.019s 3.20e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
1.1% 98.0% 0.019s 3.11e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuReshape
0.8% 98.8% 0.013s 2.15e-06s C 6000 4 theano.tensor.elemwise.Elemwise
0.7% 99.4% 0.011s 1.84e-06s C 6000 4 theano.tensor.opt.MakeVector
0.6% 100.0% 0.010s 3.21e-06s C 3000 2 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
26.7% 26.7% 0.448s 3.73e-05s C 12000 8 GpuDot22
17.2% 43.8% 0.289s 4.81e-05s C 6000 4 GpuGemm{inplace}
7.3% 51.1% 0.122s 2.04e-05s C 6000 4 GpuElemwise{mul,no_inplace}
4.3% 55.4% 0.072s 2.42e-05s C 3000 2 GpuCAReduce{maximum}{1,0}
4.2% 59.7% 0.071s 2.36e-05s C 3000 2 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}
3.6% 63.3% 0.061s 2.02e-05s C 3000 2 GpuElemwise{add,no_inplace}
3.6% 66.9% 0.060s 2.01e-05s C 3000 2 GpuCAReduce{add}{1,0,0}
3.6% 70.4% 0.060s 2.00e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
3.5% 74.0% 0.059s 1.97e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
3.5% 77.4% 0.059s 1.96e-05s C 3000 2 GpuCAReduce{add}{1,0}
3.4% 80.9% 0.058s 1.92e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)]
3.3% 84.2% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Add}[(0, 0)]
3.3% 87.5% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Tanh}[(0, 0)]
2.7% 90.3% 0.046s 1.53e-05s C 3000 2 GpuFromHost
1.8% 92.1% 0.030s 2.03e-05s C 1500 1 GpuElemwise{sub,no_inplace}
1.1% 93.2% 0.019s 3.11e-06s C 6000 4 GpuReshape{2}
0.8% 94.0% 0.014s 2.35e-06s C 6000 4 GpuDimShuffle{x,0}
0.7% 94.7% 0.011s 1.84e-06s C 6000 4 MakeVector{dtype='int64'}
0.6% 95.3% 0.011s 3.54e-06s C 3000 2 GpuSubtensor{::, :int64:}
0.6% 95.9% 0.010s 3.21e-06s C 3000 2 InplaceDimShuffle{x,0}
... (remaining 10 Ops account for 4.11%(0.07s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.3% 4.3% 0.073s 4.86e-05s 1500 14 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
4.3% 8.6% 0.072s 4.82e-05s 1500 17 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
4.3% 12.9% 0.072s 4.78e-05s 1500 35 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
4.3% 17.2% 0.072s 4.78e-05s 1500 36 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.5% 20.6% 0.058s 3.87e-05s 1500 11 GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
3.4% 24.0% 0.057s 3.82e-05s 1500 5 GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
3.3% 27.4% 0.056s 3.74e-05s 1500 39 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.3% 30.7% 0.056s 3.71e-05s 1500 33 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.3% 34.0% 0.056s 3.70e-05s 1500 34 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.3% 37.3% 0.055s 3.69e-05s 1500 40 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.3% 40.6% 0.055s 3.67e-05s 1500 50 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=c
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=c
3.3% 43.8% 0.055s 3.66e-05s 1500 49 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=c
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=c
2.2% 46.0% 0.036s 2.43e-05s 1500 53 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=float32, shape=(10,), strides=c
2.1% 48.2% 0.036s 2.40e-05s 1500 54 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=float32, shape=(10,), strides=c
2.1% 50.3% 0.036s 2.37e-05s 1500 37 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 100), strides=c
input 5: dtype=float32, shape=(1, 1), strides=c
input 6: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.1% 52.4% 0.035s 2.35e-05s 1500 38 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 100), strides=c
input 5: dtype=float32, shape=(1, 1), strides=c
input 6: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.9% 54.3% 0.032s 2.13e-05s 1500 71 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
input 0: dtype=float32, shape=(12, 10, 1), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
1.9% 56.2% 0.032s 2.11e-05s 1500 72 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
input 0: dtype=float32, shape=(12, 10, 1), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
1.8% 58.0% 0.031s 2.04e-05s 1500 43 GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 10, 100), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
1.8% 59.8% 0.030s 2.03e-05s 1500 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 1), strides=c
... (remaining 55 Apply instances account for 40.20%(0.68s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 118KB (204KB)
CPU + GPU: 118KB (204KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 118KB (204KB)
CPU + GPU: 118KB (204KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 345KB
CPU + GPU: 345KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace1[cuda])
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace0[cuda])
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0)
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0)
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
... (remaining 55 Apply account for 62508B/710508B ((8.80%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 9.275899e+00s
Total time spent in calling the VM 9.022675e+00s (97.270%)
Total overhead (computing slices..) 2.532237e-01s (2.730%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
40.7% 40.7% 1.753s 1.98e-05s C 88500 59 theano.sandbox.cuda.basic_ops.GpuElemwise
20.9% 61.6% 0.901s 3.76e-05s C 24000 16 theano.sandbox.cuda.blas.GpuDot22
17.1% 78.7% 0.735s 4.90e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm
8.7% 87.4% 0.377s 2.09e-05s C 18000 12 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.8% 90.2% 0.122s 2.03e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
2.3% 92.6% 0.099s 2.37e-06s C 42000 28 theano.sandbox.cuda.basic_ops.GpuDimShuffle
2.0% 94.5% 0.085s 1.41e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuFromHost
1.3% 95.8% 0.057s 1.90e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAlloc
1.1% 97.0% 0.049s 3.25e-06s C 15000 10 theano.sandbox.cuda.basic_ops.GpuReshape
0.8% 97.8% 0.036s 2.00e-06s C 18000 12 theano.compile.ops.Shape_i
0.7% 98.5% 0.028s 2.36e-06s C 12000 8 theano.tensor.elemwise.Elemwise
0.6% 99.0% 0.024s 1.99e-06s C 12000 8 theano.tensor.opt.MakeVector
0.5% 99.5% 0.022s 3.62e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.5% 100.0% 0.020s 3.38e-06s C 6000 4 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
20.9% 20.9% 0.901s 3.76e-05s C 24000 16 GpuDot22
13.2% 34.1% 0.568s 4.74e-05s C 12000 8 GpuGemm{inplace}
6.9% 41.1% 0.299s 1.99e-05s C 15000 10 GpuElemwise{mul,no_inplace}
4.1% 45.2% 0.176s 1.96e-05s C 9000 6 GpuCAReduce{add}{1,0}
3.9% 49.0% 0.168s 1.86e-05s C 9000 6 GpuElemwise{Add}[(0, 0)]
3.9% 52.9% 0.167s 5.55e-05s C 3000 2 GpuGemm{no_inplace}
2.8% 55.7% 0.119s 1.98e-05s C 6000 4 GpuElemwise{Add}[(0, 1)]
2.6% 58.3% 0.114s 1.90e-05s C 6000 4 GpuElemwise{add,no_inplace}
2.0% 60.3% 0.085s 1.41e-05s C 6000 4 GpuFromHost
1.8% 62.1% 0.078s 2.61e-05s C 3000 2 GpuCAReduce{maximum}{1,0}
1.6% 63.7% 0.070s 2.32e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
1.5% 65.3% 0.067s 2.23e-05s C 3000 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
1.5% 66.8% 0.067s 2.22e-05s C 3000 2 GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)]
1.5% 68.3% 0.064s 2.14e-05s C 3000 2 GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}
1.5% 69.8% 0.063s 2.11e-05s C 3000 2 GpuIncSubtensor{InplaceInc;::, int64::}
1.4% 71.2% 0.062s 2.07e-05s C 3000 2 GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace}
1.4% 72.6% 0.061s 2.05e-05s C 3000 2 GpuCAReduce{add}{0,0,1}
1.4% 74.1% 0.061s 2.03e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
1.4% 75.5% 0.061s 2.02e-05s C 3000 2 GpuCAReduce{add}{1,0,0}
1.4% 76.9% 0.060s 2.02e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 33 Ops account for 23.13%(1.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
2.0% 2.0% 0.085s 5.69e-05s 1500 151 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.9% 3.9% 0.081s 5.41e-05s 1500 152 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.8% 5.6% 0.077s 5.10e-05s 1500 172 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.8% 7.4% 0.076s 5.09e-05s 1500 174 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.7% 9.1% 0.073s 4.86e-05s 1500 83 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.7% 10.8% 0.073s 4.85e-05s 1500 27 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.7% 12.5% 0.073s 4.84e-05s 1500 36 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.7% 14.2% 0.072s 4.80e-05s 1500 81 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.6% 15.8% 0.070s 4.70e-05s 1500 171 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 200), strides=(1, 200)
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.6% 17.4% 0.070s 4.69e-05s 1500 173 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 200), strides=(1, 200)
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.5% 18.9% 0.063s 4.20e-05s 1500 132 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 120), strides=(1, 100)
input 1: dtype=float32, shape=(120, 1), strides=(1, 0)
output 0: dtype=float32, shape=(100, 1), strides=(1, 0)
1.5% 20.3% 0.063s 4.19e-05s 1500 2 GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.5% 21.8% 0.063s 4.18e-05s 1500 177 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 200), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.5% 23.3% 0.063s 4.17e-05s 1500 175 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 200), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
1.4% 24.6% 0.059s 3.95e-05s 1500 79 GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=(1, 0)
1.4% 26.0% 0.058s 3.90e-05s 1500 160 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.4% 27.3% 0.058s 3.90e-05s 1500 159 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.3% 28.7% 0.058s 3.87e-05s 1500 9 GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.3% 30.0% 0.058s 3.87e-05s 1500 22 GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
1.3% 31.4% 0.057s 3.83e-05s 1500 16 GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
... (remaining 161 Apply instances account for 68.63%(2.96s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 369KB (376KB)
CPU + GPU: 369KB (377KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 393KB (377KB)
CPU + GPU: 393KB (378KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 796KB
CPU + GPU: 796KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
48000B [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
48000B [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
48000B [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0)
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
... (remaining 161 Apply account for 443232B/1595232B ((27.78%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 2.294225e+00s
Total time spent in calling the VM 2.171460e+00s (94.649%)
Total overhead (computing slices..) 1.227655e-01s (5.351%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
44.1% 44.1% 0.505s 1.91e-05s C 26400 22 theano.sandbox.cuda.basic_ops.GpuElemwise
32.8% 76.9% 0.375s 5.21e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm
8.3% 85.2% 0.095s 1.97e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
8.1% 93.3% 0.093s 3.88e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22
3.9% 97.2% 0.044s 1.84e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc
1.4% 98.6% 0.016s 3.41e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.9% 99.5% 0.010s 2.16e-06s C 4800 4 theano.compile.ops.Shape_i
0.5% 100.0% 0.005s 2.19e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
22.1% 22.1% 0.253s 5.28e-05s C 4800 4 GpuGemm{no_inplace}
12.2% 34.4% 0.140s 1.94e-05s C 7200 6 GpuElemwise{mul,no_inplace}
10.7% 45.0% 0.122s 5.09e-05s C 2400 2 GpuGemm{inplace}
8.1% 53.2% 0.093s 3.88e-05s C 2400 2 GpuDot22
4.4% 57.6% 0.051s 2.11e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
4.3% 61.8% 0.049s 2.03e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::}
4.2% 66.1% 0.048s 2.02e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
4.1% 70.1% 0.047s 1.94e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.0% 74.2% 0.046s 1.92e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:}
3.9% 78.1% 0.045s 1.88e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)]
3.9% 82.0% 0.045s 1.86e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
3.9% 85.9% 0.044s 1.84e-05s C 2400 2 GpuAlloc{memset_0=True}
3.8% 89.7% 0.044s 1.82e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
3.8% 93.5% 0.044s 1.82e-05s C 2400 2 GpuElemwise{sub,no_inplace}
3.7% 97.2% 0.042s 1.77e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)]
0.7% 97.9% 0.008s 3.51e-06s C 2400 2 GpuSubtensor{::, int64::}
0.7% 98.6% 0.008s 3.31e-06s C 2400 2 GpuSubtensor{::, :int64:}
0.5% 99.1% 0.006s 2.39e-06s C 2400 2 Shape_i{1}
0.5% 99.6% 0.005s 2.19e-06s C 2400 2 GpuDimShuffle{1,0}
0.4% 100.0% 0.005s 1.94e-06s C 2400 2 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.8% 5.8% 0.066s 5.50e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
5.5% 11.3% 0.063s 5.26e-05s 1200 7 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
5.4% 16.7% 0.062s 5.19e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.4% 22.1% 0.062s 5.15e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.3% 27.5% 0.061s 5.09e-05s 1200 42 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.3% 32.8% 0.061s 5.08e-05s 1200 43 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.1% 36.9% 0.047s 3.89e-05s 1200 30 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.1% 40.9% 0.046s 3.87e-05s 1200 31 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.2% 43.1% 0.025s 2.12e-05s 1200 44 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=(1, 0)
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.2% 45.3% 0.025s 2.10e-05s 1200 45 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=(1, 0)
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 47.5% 0.025s 2.05e-05s 1200 36 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.1% 49.6% 0.025s 2.05e-05s 1200 26 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 51.8% 0.024s 2.02e-05s 1200 37 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.1% 53.8% 0.024s 1.99e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 55.9% 0.024s 1.99e-05s 1200 27 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 58.0% 0.024s 1.98e-05s 1200 11 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.1% 60.1% 0.024s 1.98e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.0% 62.1% 0.023s 1.95e-05s 1200 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.0% 64.1% 0.023s 1.93e-05s 1200 38 GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.0% 66.1% 0.023s 1.92e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
... (remaining 26 Apply instances account for 33.85%(0.39s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 55KB (78KB)
CPU + GPU: 55KB (78KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 66KB (86KB)
CPU + GPU: 66KB (86KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 94KB
CPU + GPU: 94KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
4000B [(10, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuGemm{no_inplace}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
4000B [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 2.288652e+00s
Total time spent in calling the VM 2.166391e+00s (94.658%)
Total overhead (computing slices..) 1.222615e-01s (5.342%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
44.0% 44.0% 0.503s 1.91e-05s C 26400 22 theano.sandbox.cuda.basic_ops.GpuElemwise
32.9% 76.9% 0.376s 5.22e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm
8.3% 85.2% 0.094s 1.97e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
8.2% 93.4% 0.093s 3.89e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22
3.9% 97.2% 0.044s 1.84e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc
1.4% 98.7% 0.016s 3.43e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.9% 99.6% 0.010s 2.12e-06s C 4800 4 theano.compile.ops.Shape_i
0.4% 100.0% 0.005s 2.11e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
22.2% 22.2% 0.254s 5.28e-05s C 4800 4 GpuGemm{no_inplace}
12.2% 34.4% 0.140s 1.94e-05s C 7200 6 GpuElemwise{mul,no_inplace}
10.7% 45.1% 0.122s 5.09e-05s C 2400 2 GpuGemm{inplace}
8.2% 53.3% 0.093s 3.89e-05s C 2400 2 GpuDot22
4.4% 57.7% 0.051s 2.11e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
4.3% 62.0% 0.049s 2.03e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::}
4.2% 66.2% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
4.0% 70.3% 0.046s 1.93e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.0% 74.3% 0.046s 1.91e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:}
3.9% 78.2% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
3.9% 82.0% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)]
3.9% 85.9% 0.044s 1.84e-05s C 2400 2 GpuAlloc{memset_0=True}
3.8% 89.7% 0.044s 1.82e-05s C 2400 2 GpuElemwise{sub,no_inplace}
3.8% 93.5% 0.044s 1.82e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
3.7% 97.2% 0.042s 1.75e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)]
0.7% 98.0% 0.008s 3.54e-06s C 2400 2 GpuSubtensor{::, int64::}
0.7% 98.7% 0.008s 3.32e-06s C 2400 2 GpuSubtensor{::, :int64:}
0.5% 99.2% 0.006s 2.38e-06s C 2400 2 Shape_i{1}
0.4% 99.6% 0.005s 2.11e-06s C 2400 2 GpuDimShuffle{1,0}
0.4% 100.0% 0.004s 1.87e-06s C 2400 2 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.8% 5.8% 0.066s 5.52e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
5.5% 11.3% 0.063s 5.27e-05s 1200 7 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
5.4% 16.8% 0.062s 5.19e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.4% 22.2% 0.062s 5.16e-05s 1200 22 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.4% 27.5% 0.061s 5.10e-05s 1200 43 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
5.3% 32.9% 0.061s 5.09e-05s 1200 42 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.1% 37.0% 0.047s 3.89e-05s 1200 30 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.1% 41.1% 0.047s 3.88e-05s 1200 31 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.2% 43.3% 0.025s 2.12e-05s 1200 44 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=(1, 0)
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.2% 45.5% 0.025s 2.11e-05s 1200 45 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=(1, 0)
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.2% 47.6% 0.025s 2.05e-05s 1200 36 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.1% 49.8% 0.024s 2.04e-05s 1200 26 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 51.9% 0.024s 2.00e-05s 1200 37 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.1% 54.0% 0.024s 1.99e-05s 1200 27 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 56.1% 0.024s 1.97e-05s 1200 19 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 58.1% 0.024s 1.97e-05s 1200 18 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.1% 60.2% 0.023s 1.95e-05s 1200 11 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.0% 62.2% 0.023s 1.93e-05s 1200 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.0% 64.2% 0.023s 1.93e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.0% 66.3% 0.023s 1.92e-05s 1200 38 GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
... (remaining 26 Apply instances account for 33.75%(0.39s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 55KB (78KB)
CPU + GPU: 55KB (78KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 66KB (86KB)
CPU + GPU: 66KB (86KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 94KB
CPU + GPU: 94KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda])
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(17) printed profiles at exit excluding Scan op profile.
Time in 6938 calls to Function.__call__: 1.028157e+02s
Time in Function.fn.__call__: 1.024500e+02s (99.644%)
Time in thunks: 4.343875e+01s (42.249%)
Total compile time: 6.253434e+02s
Number of Apply nodes: 0
Theano Optimizer time: 2.134617e+02s
Theano validate time: 4.772263e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 2.980593e+02s
Import time 1.529284e+01s
Time in all call to theano.grad() 2.823545e+00s
Time since theano import 834.193s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
66.5% 66.5% 28.871s 3.42e-02s Py 844 11 theano.scan_module.scan_op.Scan
21.8% 88.2% 9.454s 5.87e-02s Py 161 2 lvsr.ops.EditDistanceOp
4.3% 92.5% 1.874s 2.16e-05s C 86731 877 theano.sandbox.cuda.basic_ops.GpuElemwise
1.8% 94.3% 0.779s 3.05e-05s C 25580 252 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.9% 95.2% 0.395s 4.64e-05s C 8505 86 theano.sandbox.cuda.blas.GpuDot22
0.8% 96.0% 0.340s 3.54e-06s C 96048 1098 theano.tensor.elemwise.Elemwise
0.7% 96.7% 0.313s 1.81e-05s C 17247 197 theano.sandbox.cuda.basic_ops.HostFromGpu
0.4% 97.2% 0.180s 2.53e-05s C 7127 75 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 97.6% 0.168s 2.24e-05s Py 7505 51 theano.ifelse.IfElse
0.3% 97.9% 0.151s 3.26e-06s C 46180 473 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 98.2% 0.150s 2.61e-05s C 5766 61 theano.sandbox.cuda.basic_ops.GpuAlloc
0.3% 98.6% 0.137s 7.11e-06s C 19212 205 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 98.9% 0.127s 7.95e-06s C 16013 116 theano.compile.ops.DeepCopyOp
0.1% 99.0% 0.056s 4.38e-05s C 1280 9 theano.sandbox.cuda.blas.GpuGemm
0.1% 99.1% 0.056s 1.55e-05s C 3593 31 theano.sandbox.cuda.basic_ops.GpuFromHost
0.1% 99.2% 0.054s 3.50e-06s C 15373 167 theano.tensor.opt.MakeVector
0.1% 99.4% 0.051s 4.25e-06s C 12067 128 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 99.5% 0.046s 3.25e-06s C 14041 157 theano.compile.ops.Shape_i
0.1% 99.5% 0.035s 7.33e-05s C 472 6 theano.sandbox.cuda.basic_ops.GpuJoin
0.1% 99.6% 0.033s 5.02e-05s C 648 7 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
... (remaining 24 Classes account for 0.39%(0.17s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
21.8% 21.8% 9.454s 5.87e-02s Py 161 2 EditDistanceOp
21.5% 43.2% 9.321s 9.32e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
14.2% 57.4% 6.165s 6.16e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}
10.6% 68.0% 4.615s 2.31e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
8.5% 76.5% 3.680s 3.68e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
4.7% 81.2% 2.026s 3.32e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan}
3.7% 84.9% 1.599s 1.60e-02s Py 100 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
3.2% 88.0% 1.380s 8.57e-03s Py 161 2 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
0.9% 88.9% 0.395s 4.64e-05s C 8505 86 GpuDot22
0.7% 89.7% 0.319s 3.80e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1}
0.7% 90.4% 0.313s 1.81e-05s C 17247 197 HostFromGpu
0.5% 90.9% 0.207s 2.13e-05s C 9700 97 GpuElemwise{add,no_inplace}
0.4% 91.3% 0.173s 2.20e-05s C 7861 79 GpuElemwise{sub,no_inplace}
0.4% 91.7% 0.171s 3.57e-05s C 4800 48 GpuCAReduce{add}{1,1}
0.4% 92.0% 0.155s 2.38e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
0.4% 92.4% 0.154s 2.49e-05s Py 6200 39 if{gpu}
0.3% 92.7% 0.150s 2.34e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
0.3% 93.0% 0.134s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
0.3% 93.3% 0.133s 2.05e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
0.3% 93.6% 0.133s 2.29e-05s C 5800 58 GpuElemwise{Switch,no_inplace}
... (remaining 328 Ops account for 6.36%(2.76s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
21.5% 21.5% 9.321s 9.32e-02s 100 2406 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 12), strides=c
input 2: dtype=float32, shape=(15, 10, 200), strides=c
input 3: dtype=float32, shape=(15, 10, 100), strides=c
input 4: dtype=float32, shape=(15, 10, 100), strides=c
input 5: dtype=float32, shape=(15, 10, 100), strides=c
input 6: dtype=float32, shape=(15, 10, 1), strides=c
input 7: dtype=float32, shape=(15, 10, 200), strides=c
input 8: dtype=float32, shape=(15, 10, 12), strides=c
input 9: dtype=float32, shape=(15, 10, 200), strides=c
input 10: dtype=float32, shape=(15, 10, 100), strides=c
input 11: dtype=float32, shape=(15, 10, 100), strides=c
input 12: dtype=float32, shape=(15, 10, 100), strides=c
input 13: dtype=float32, shape=(15, 10, 200), strides=c
input 14: dtype=float32, shape=(16, 10, 100), strides=c
input 15: dtype=float32, shape=(16, 10, 200), strides=c
input 16: dtype=float32, shape=(16, 10, 12), strides=c
input 17: dtype=float32, shape=(16, 10, 100), strides=c
input 18: dtype=float32, shape=(16, 10, 200), strides=c
input 19: dtype=float32, shape=(16, 10, 12), strides=c
input 20: dtype=float32, shape=(2, 100, 1), strides=c
input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c
input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c
input 23: dtype=float32, shape=(2, 100, 1), strides=c
input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c
input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c
input 26: dtype=int64, shape=(), strides=c
input 27: dtype=int64, shape=(), strides=c
input 28: dtype=int64, shape=(), strides=c
input 29: dtype=int64, shape=(), strides=c
input 30: dtype=int64, shape=(), strides=c
input 31: dtype=int64, shape=(), strides=c
input 32: dtype=int64, shape=(), strides=c
input 33: dtype=int64, shape=(), strides=c
input 34: dtype=float32, shape=(100, 200), strides=c
input 35: dtype=float32, shape=(200, 200), strides=c
input 36: dtype=float32, shape=(100, 100), strides=c
input 37: dtype=float32, shape=(200, 100), strides=c
input 38: dtype=float32, shape=(100, 100), strides=c
input 39: dtype=float32, shape=(200, 200), strides=c
input 40: dtype=float32, shape=(200, 100), strides=c
input 41: dtype=float32, shape=(100, 100), strides=c
input 42: dtype=float32, shape=(100, 200), strides=c
input 43: dtype=float32, shape=(100, 100), strides=c
input 44: dtype=int64, shape=(2,), strides=c
input 45: dtype=float32, shape=(12, 10, 100), strides=c
input 46: dtype=int64, shape=(1,), strides=c
input 47: dtype=float32, shape=(12, 10), strides=c
input 48: dtype=float32, shape=(12, 10, 200), strides=c
input 49: dtype=float32, shape=(100, 1), strides=c
input 50: dtype=int8, shape=(10,), strides=c
input 51: dtype=float32, shape=(1, 100), strides=c
input 52: dtype=float32, shape=(100, 200), strides=c
input 53: dtype=float32, shape=(200, 200), strides=c
input 54: dtype=float32, shape=(100, 100), strides=c
input 55: dtype=float32, shape=(200, 100), strides=c
input 56: dtype=float32, shape=(100, 100), strides=c
input 57: dtype=float32, shape=(200, 200), strides=c
input 58: dtype=float32, shape=(200, 100), strides=c
input 59: dtype=float32, shape=(100, 100), strides=c
input 60: dtype=float32, shape=(100, 200), strides=c
input 61: dtype=float32, shape=(100, 100), strides=c
input 62: dtype=int64, shape=(2,), strides=c
input 63: dtype=float32, shape=(12, 10, 100), strides=c
input 64: dtype=int64, shape=(1,), strides=c
input 65: dtype=float32, shape=(12, 10), strides=c
input 66: dtype=float32, shape=(12, 10, 200), strides=c
input 67: dtype=float32, shape=(100, 1), strides=c
input 68: dtype=int8, shape=(10,), strides=c
input 69: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(16, 10, 100), strides=c
output 1: dtype=float32, shape=(16, 10, 200), strides=c
output 2: dtype=float32, shape=(16, 10, 12), strides=c
output 3: dtype=float32, shape=(16, 10, 100), strides=c
output 4: dtype=float32, shape=(16, 10, 200), strides=c
output 5: dtype=float32, shape=(16, 10, 12), strides=c
output 6: dtype=float32, shape=(2, 100, 1), strides=c
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c
output 9: dtype=float32, shape=(2, 100, 1), strides=c
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c
output 12: dtype=float32, shape=(15, 10, 100), strides=c
output 13: dtype=float32, shape=(15, 10, 200), strides=c
output 14: dtype=float32, shape=(15, 10, 100), strides=c
output 15: dtype=float32, shape=(15, 100, 10), strides=c
output 16: dtype=float32, shape=(15, 10, 100), strides=c
output 17: dtype=float32, shape=(15, 10, 200), strides=c
output 18: dtype=float32, shape=(15, 10, 100), strides=c
output 19: dtype=float32, shape=(15, 100, 10), strides=c
19.5% 40.9% 8.452s 1.39e-01s 61 279 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
input 0: dtype=int64, shape=(15, 75), strides=c
input 1: dtype=float32, shape=(15, 75), strides=c
input 2: dtype=int64, shape=(12, 75), strides=c
input 3: dtype=float32, shape=(12, 75), strides=c
output 0: dtype=int64, shape=(15, 75, 1), strides=c
14.2% 55.1% 6.165s 6.16e-02s 100 1795 forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=c
input 2: dtype=float32, shape=(1, 10, 200), strides=c
input 3: dtype=float32, shape=(1, 92160), strides=c
input 4: dtype=float32, shape=(1, 10, 100), strides=c
input 5: dtype=float32, shape=(1, 10, 200), strides=c
input 6: dtype=float32, shape=(2, 92160), strides=c
input 7: dtype=int64, shape=(), strides=c
input 8: dtype=int64, shape=(), strides=c
input 9: dtype=float32, shape=(100, 44), strides=c
input 10: dtype=float32, shape=(200, 44), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(200, 200), strides=c
input 13: dtype=float32, shape=(45, 100), strides=c
input 14: dtype=float32, shape=(100, 200), strides=c
input 15: dtype=float32, shape=(100, 100), strides=c
input 16: dtype=float32, shape=(200, 100), strides=c
input 17: dtype=float32, shape=(100, 100), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(1, 44), strides=c
input 20: dtype=float32, shape=(1, 200), strides=c
input 21: dtype=float32, shape=(1, 100), strides=c
input 22: dtype=int64, shape=(1,), strides=c
input 23: dtype=float32, shape=(12, 10), strides=c
input 24: dtype=float32, shape=(12, 10, 200), strides=c
input 25: dtype=float32, shape=(100, 1), strides=c
input 26: dtype=int8, shape=(10,), strides=c
input 27: dtype=float32, shape=(12, 10, 100), strides=c
input 28: dtype=float32, shape=(12, 10, 200), strides=c
input 29: dtype=float32, shape=(12, 10, 100), strides=c
output 0: dtype=float32, shape=(1, 10, 100), strides=c
output 1: dtype=float32, shape=(1, 10, 200), strides=c
output 2: dtype=float32, shape=(1, 92160), strides=c
output 3: dtype=float32, shape=(1, 10, 100), strides=c
output 4: dtype=float32, shape=(1, 10, 200), strides=c
output 5: dtype=float32, shape=(2, 92160), strides=c
output 6: dtype=int64, shape=(15, 10), strides=c
output 7: dtype=int64, shape=(15, 10), strides=c
8.5% 63.6% 3.680s 3.68e-02s 100 2157 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 200), strides=c
input 2: dtype=float32, shape=(15, 10, 100), strides=c
input 3: dtype=float32, shape=(15, 10, 1), strides=c
input 4: dtype=float32, shape=(15, 10, 200), strides=c
input 5: dtype=float32, shape=(15, 10, 100), strides=c
input 6: dtype=float32, shape=(16, 10, 100), strides=c
input 7: dtype=float32, shape=(16, 10, 200), strides=c
input 8: dtype=float32, shape=(16, 10, 12), strides=c
input 9: dtype=float32, shape=(16, 10, 100), strides=c
input 10: dtype=float32, shape=(16, 10, 200), strides=c
input 11: dtype=float32, shape=(16, 10, 12), strides=c
input 12: dtype=float32, shape=(100, 200), strides=c
input 13: dtype=float32, shape=(200, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(200, 100), strides=c
input 16: dtype=float32, shape=(100, 100), strides=c
input 17: dtype=float32, shape=(12, 10), strides=c
input 18: dtype=float32, shape=(12, 10, 100), strides=c
input 19: dtype=int64, shape=(1,), strides=c
input 20: dtype=float32, shape=(12, 10, 200), strides=c
input 21: dtype=int8, shape=(10,), strides=c
input 22: dtype=float32, shape=(100, 1), strides=c
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(200, 200), strides=c
input 25: dtype=float32, shape=(100, 100), strides=c
input 26: dtype=float32, shape=(200, 100), strides=c
input 27: dtype=float32, shape=(100, 100), strides=c
input 28: dtype=float32, shape=(12, 10), strides=c
input 29: dtype=float32, shape=(12, 10, 100), strides=c
input 30: dtype=int64, shape=(1,), strides=c
input 31: dtype=float32, shape=(12, 10, 200), strides=c
input 32: dtype=int8, shape=(10,), strides=c
input 33: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(16, 10, 100), strides=c
output 1: dtype=float32, shape=(16, 10, 200), strides=c
output 2: dtype=float32, shape=(16, 10, 12), strides=c
output 3: dtype=float32, shape=(16, 10, 100), strides=c
output 4: dtype=float32, shape=(16, 10, 200), strides=c
output 5: dtype=float32, shape=(16, 10, 12), strides=c
5.3% 68.9% 2.311s 2.31e-02s 100 2602 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0,
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 100), strides=c
input 4: dtype=float32, shape=(12, 10, 1), strides=c
input 5: dtype=float32, shape=(12, 10, 200), strides=c
input 6: dtype=float32, shape=(12, 10, 100), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(12, 10, 1), strides=c
input 9: dtype=float32, shape=(13, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=int64, shape=(), strides=c
input 12: dtype=int64, shape=(), strides=c
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=float32, shape=(100, 200), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(200, 100), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(100, 200), strides=c
input 22: dtype=float32, shape=(100, 100), strides=c
input 23: dtype=float32, shape=(200, 100), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(12, 10, 200), strides=c
output 4: dtype=float32, shape=(12, 100, 10), strides=c
output 5: dtype=float32, shape=(12, 10, 100), strides=c
output 6: dtype=float32, shape=(12, 10, 200), strides=c
output 7: dtype=float32, shape=(12, 100, 10), strides=c
5.3% 74.2% 2.305s 2.30e-02s 100 2603 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 100), strides=c
input 4: dtype=float32, shape=(12, 10, 1), strides=c
input 5: dtype=float32, shape=(12, 10, 200), strides=c
input 6: dtype=float32, shape=(12, 10, 100), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(12, 10, 1), strides=c
input 9: dtype=float32, shape=(13, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=int64, shape=(), strides=c
input 12: dtype=int64, shape=(), strides=c
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=float32, shape=(100, 200), strides=c
input 18: dtype=float32, shape=(100, 100), strides=c
input 19: dtype=float32, shape=(200, 100), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(100, 200), strides=c
input 22: dtype=float32, shape=(100, 100), strides=c
input 23: dtype=float32, shape=(200, 100), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(12, 10, 200), strides=c
output 4: dtype=float32, shape=(12, 100, 10), strides=c
output 5: dtype=float32, shape=(12, 10, 100), strides=c
output 6: dtype=float32, shape=(12, 10, 200), strides=c
output 7: dtype=float32, shape=(12, 100, 10), strides=c
4.7% 78.9% 2.026s 3.32e-02s 61 268 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 75), strides=(75, 1)
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(75,), strides=c
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 75), strides=c
3.7% 82.5% 1.599s 1.60e-02s 100 1601 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 1), strides=c
input 4: dtype=float32, shape=(12, 10, 200), strides=c
input 5: dtype=float32, shape=(12, 10, 100), strides=c
input 6: dtype=float32, shape=(12, 10, 1), strides=c
input 7: dtype=float32, shape=(12, 10, 100), strides=c
input 8: dtype=float32, shape=(13, 10, 100), strides=c
input 9: dtype=float32, shape=(12, 10, 100), strides=c
input 10: dtype=float32, shape=(13, 10, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
output 2: dtype=float32, shape=(12, 10, 100), strides=c
output 3: dtype=float32, shape=(13, 10, 100), strides=c
2.3% 84.9% 1.002s 1.00e-02s 100 1861 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10)
input 0: dtype=int64, shape=(15, 10), strides=c
input 1: dtype=float32, shape=(15, 10), strides=c
input 2: dtype=int64, shape=(12, 10), strides=c
input 3: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(15, 10, 1), strides=c
2.0% 86.8% 0.851s 8.51e-03s 100 1611 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
input 2: dtype=float32, shape=(12, 10, 100), strides=c
input 3: dtype=float32, shape=(12, 10, 1), strides=c
input 4: dtype=float32, shape=(12, 10, 200), strides=c
input 5: dtype=float32, shape=(12, 10, 100), strides=c
input 6: dtype=float32, shape=(12, 10, 1), strides=c
input 7: dtype=float32, shape=(13, 10, 100), strides=c
input 8: dtype=float32, shape=(13, 10, 100), strides=c
input 9: dtype=float32, shape=(100, 200), strides=c
input 10: dtype=float32, shape=(100, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=c
output 1: dtype=float32, shape=(13, 10, 100), strides=c
1.2% 88.0% 0.528s 8.66e-03s 61 254 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1)
input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0)
input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 9: dtype=float32, shape=(100, 200), strides=c
input 10: dtype=float32, shape=(100, 100), strides=c
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.1% 88.1% 0.043s 3.88e-03s 11 140 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.1% 88.2% 0.042s 3.79e-03s 11 182 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.1% 88.3% 0.023s 3.81e-06s 6075 0 DeepCopyOp(labels)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
0.0% 88.3% 0.016s 2.59e-06s 6075 1 DeepCopyOp(inputs)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
0.0% 88.3% 0.011s 1.11e-04s 100 2572 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(12, 10, 100), strides=c
0.0% 88.4% 0.010s 1.05e-04s 100 2573 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
output 1: dtype=float32, shape=(12, 10, 100), strides=c
0.0% 88.4% 0.010s 9.82e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(15, 10), strides=c
0.0% 88.4% 0.009s 4.94e-05s 176 37 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.0% 88.4% 0.008s 8.06e-05s 100 2356 GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(15, 10), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(14, 10), strides=c
output 1: dtype=float32, shape=(1, 10), strides=c
... (remaining 4271 Apply instances account for 11.57%(5.03s) of the runtime)
Memory Profile (the max between all functions in that profile)
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 57KB (61KB)
GPU: 4979KB (6661KB)
CPU + GPU: 5035KB (6721KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 56KB (61KB)
GPU: 6160KB (7107KB)
CPU + GPU: 6216KB (7167KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 115KB
GPU: 16958KB
CPU + GPU: 17073KB
---
This list is based on all functions in the profile
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
1132320B [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0)
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
720000B [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0)
720000B [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
... (remaining 4271 Apply account for 67003141B/82625821B ((81.09%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment