Skip to content

Instantly share code, notes, and snippets.

@rizar
Created May 6, 2016 15:30
Show Gist options
  • Save rizar/753bdaeebd6f16692c2218f537be831c to your computer and use it in GitHub Desktop.
Save rizar/753bdaeebd6f16692c2218f537be831c to your computer and use it in GitHub Desktop.
New profile
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 100 calls to Function.__call__: 2.154827e-03s
Time in Function.fn.__call__: 9.248257e-04s (42.919%)
Total compile time: 4.125585e+00s
Number of Apply nodes: 0
Theano Optimizer time: 6.079912e-03s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 9.608269e-05s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.132s
No execution time accumulated (hint: try config profiling.time_thunks=1)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 11 calls to Function.__call__: 2.018499e-02s
Time in Function.fn.__call__: 1.745415e-02s (86.471%)
Time in thunks: 7.772207e-03s (38.505%)
Total compile time: 4.343552e+00s
Number of Apply nodes: 43
Theano Optimizer time: 1.791000e-01s
Theano validate time: 1.072645e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 6.402516e-02s
Import time 4.774094e-03s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.132s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.008s 1.64e-05s C 473 43 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.008s 1.64e-05s C 473 43 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.7% 4.7% 0.000s 3.34e-05s 11 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.8% 7.5% 0.000s 1.99e-05s 11 31 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.7% 10.3% 0.000s 1.93e-05s 11 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.6% 12.9% 0.000s 1.84e-05s 11 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.6% 15.4% 0.000s 1.81e-05s 11 16 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.6% 18.0% 0.000s 1.81e-05s 11 23 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 20.5% 0.000s 1.80e-05s 11 3 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 23.1% 0.000s 1.80e-05s 11 24 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 25.6% 0.000s 1.80e-05s 11 4 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 28.2% 0.000s 1.79e-05s 11 27 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 30.7% 0.000s 1.78e-05s 11 25 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 33.2% 0.000s 1.78e-05s 11 8 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 35.7% 0.000s 1.78e-05s 11 5 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 38.2% 0.000s 1.77e-05s 11 12 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 40.7% 0.000s 1.77e-05s 11 6 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 43.2% 0.000s 1.76e-05s 11 29 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 45.7% 0.000s 1.75e-05s 11 11 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 48.2% 0.000s 1.75e-05s 11 7 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
2.5% 50.6% 0.000s 1.75e-05s 11 32 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
2.5% 53.1% 0.000s 1.75e-05s 11 13 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 23 Apply instances account for 46.88%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 43 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 10 calls to Function.__call__: 1.222110e-02s
Time in Function.fn.__call__: 1.176500e-02s (96.268%)
Time in thunks: 4.612923e-03s (37.746%)
Total compile time: 4.154817e+00s
Number of Apply nodes: 29
Theano Optimizer time: 5.256701e-02s
Theano validate time: 1.211166e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 4.951882e-02s
Import time 1.188660e-02s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.137s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
52.9% 52.9% 0.002s 1.63e-05s C 150 15 theano.sandbox.cuda.basic_ops.HostFromGpu
43.7% 96.6% 0.002s 2.24e-05s C 90 9 theano.sandbox.cuda.basic_ops.GpuElemwise
3.4% 100.0% 0.000s 3.16e-06s C 50 5 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
52.9% 52.9% 0.002s 1.63e-05s C 150 15 HostFromGpu
43.7% 96.6% 0.002s 2.24e-05s C 90 9 GpuElemwise{true_div,no_inplace}
3.4% 100.0% 0.000s 3.16e-06s C 50 5 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
10.0% 10.0% 0.000s 4.61e-05s 10 0 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
5.8% 15.8% 0.000s 2.68e-05s 10 15 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
4.4% 20.2% 0.000s 2.03e-05s 10 1 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.3% 24.5% 0.000s 1.98e-05s 10 12 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.3% 28.7% 0.000s 1.96e-05s 10 2 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.2% 32.9% 0.000s 1.93e-05s 10 13 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.2% 37.1% 0.000s 1.93e-05s 10 4 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.2% 41.3% 0.000s 1.92e-05s 10 6 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 45.4% 0.000s 1.91e-05s 10 3 GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
4.1% 49.5% 0.000s 1.90e-05s 10 5 GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.6% 53.1% 0.000s 1.65e-05s 10 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.5% 56.6% 0.000s 1.63e-05s 10 26 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 60.1% 0.000s 1.59e-05s 10 20 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 63.5% 0.000s 1.58e-05s 10 19 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 66.9% 0.000s 1.57e-05s 10 7 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 70.3% 0.000s 1.56e-05s 10 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.4% 73.7% 0.000s 1.56e-05s 10 27 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 77.1% 0.000s 1.56e-05s 10 8 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.3% 80.4% 0.000s 1.54e-05s 10 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=()
output 0: dtype=float32, shape=(), strides=c
3.3% 83.7% 0.000s 1.52e-05s 10 14 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 9 Apply instances account for 16.31%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 29 Apply account for 136B/136B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 101 calls to Function.__call__: 1.747441e-02s
Time in Function.fn.__call__: 1.434040e-02s (82.065%)
Time in thunks: 2.486944e-03s (14.232%)
Total compile time: 4.068843e+00s
Number of Apply nodes: 6
Theano Optimizer time: 1.878691e-02s
Theano validate time: 5.388260e-05s
Theano Linker time (includes C, CUDA code generation/compiling): 1.104212e-02s
Import time 7.761240e-03s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.140s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
54.7% 54.7% 0.001s 3.37e-06s C 404 4 theano.compile.ops.Shape_i
45.3% 100.0% 0.001s 5.58e-06s C 202 2 theano.tensor.basic.Alloc
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
45.3% 45.3% 0.001s 5.58e-06s C 202 2 Alloc
30.7% 76.0% 0.001s 3.78e-06s C 202 2 Shape_i{1}
24.0% 100.0% 0.001s 2.95e-06s C 202 2 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
28.4% 28.4% 0.001s 7.00e-06s 101 4 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
input 0: dtype=int64, shape=(1, 1), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(15, 10), strides=c
19.0% 47.5% 0.000s 4.69e-06s 101 0 Shape_i{1}(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
16.9% 64.3% 0.000s 4.16e-06s 101 5 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
input 0: dtype=int64, shape=(1, 1), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(12, 10), strides=c
12.9% 77.2% 0.000s 3.17e-06s 101 1 Shape_i{0}(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
11.7% 88.9% 0.000s 2.88e-06s 101 2 Shape_i{1}(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
11.1% 100.0% 0.000s 2.73e-06s 101 3 Shape_i{0}(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 2KB
GPU: 0KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1200B [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
... (remaining 5 Apply account for 992B/2192B ((45.26%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 100 calls to Function.__call__: 1.629472e-02s
Time in Function.fn.__call__: 1.466155e-02s (89.977%)
Time in thunks: 9.594440e-03s (58.881%)
Total compile time: 4.084757e+00s
Number of Apply nodes: 2
Theano Optimizer time: 7.371902e-03s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.080990e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.141s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.010s 4.80e-05s C 200 2 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.010s 4.80e-05s C 200 2 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
95.0% 95.0% 0.009s 9.11e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction)
input 0: dtype=int64, shape=(15, 10), strides=c
output 0: dtype=int64, shape=(15, 10), strides=c
5.0% 100.0% 0.000s 4.83e-06s 100 1 DeepCopyOp(shared_labels)
input 0: dtype=int64, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(12, 10), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 2KB (2KB)
GPU: 0KB (0KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 2KB
GPU: 0KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1200B [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction)
... (remaining 1 Apply account for 960B/2160B ((44.44%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 2 calls to Function.__call__: 2.764940e-03s
Time in Function.fn.__call__: 2.352715e-03s (85.091%)
Time in thunks: 1.017094e-03s (36.785%)
Total compile time: 4.452709e+00s
Number of Apply nodes: 31
Theano Optimizer time: 9.523201e-02s
Theano validate time: 7.679462e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 4.307699e-02s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.142s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.001s 1.64e-05s C 62 31 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.001s 1.64e-05s C 62 31 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.7% 4.7% 0.000s 2.41e-05s 2 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.8% 8.6% 0.000s 1.94e-05s 2 6 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.8% 12.3% 0.000s 1.91e-05s 2 14 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.7% 16.0% 0.000s 1.90e-05s 2 7 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.7% 19.7% 0.000s 1.88e-05s 2 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.7% 23.4% 0.000s 1.86e-05s 2 12 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 27.0% 0.000s 1.85e-05s 2 4 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.6% 30.7% 0.000s 1.85e-05s 2 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.6% 34.2% 0.000s 1.81e-05s 2 16 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 37.8% 0.000s 1.81e-05s 2 9 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.6% 41.4% 0.000s 1.81e-05s 2 8 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.6% 44.9% 0.000s 1.81e-05s 2 5 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.6% 48.5% 0.000s 1.81e-05s 2 3 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
3.5% 52.0% 0.000s 1.80e-05s 2 24 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 55.6% 0.000s 1.80e-05s 2 19 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 59.1% 0.000s 1.80e-05s 2 13 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 62.6% 0.000s 1.79e-05s 2 11 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.5% 66.1% 0.000s 1.79e-05s 2 10 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 69.6% 0.000s 1.75e-05s 2 23 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
3.4% 73.0% 0.000s 1.75e-05s 2 21 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
... (remaining 11 Apply instances account for 26.98%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 31 Apply account for 140B/140B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 8.559227e-04s
Time in Function.fn.__call__: 8.108616e-04s (94.735%)
Time in thunks: 3.142357e-04s (36.713%)
Total compile time: 4.539160e+00s
Number of Apply nodes: 21
Theano Optimizer time: 3.893209e-02s
Theano validate time: 8.273125e-05s
Theano Linker time (includes C, CUDA code generation/compiling): 2.924204e-02s
Import time 2.619028e-03s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.146s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
58.3% 58.3% 0.000s 1.66e-05s C 11 11 theano.sandbox.cuda.basic_ops.HostFromGpu
36.9% 95.1% 0.000s 1.93e-05s C 6 6 theano.sandbox.cuda.basic_ops.GpuElemwise
4.9% 100.0% 0.000s 3.81e-06s C 4 4 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
58.3% 58.3% 0.000s 1.66e-05s C 11 11 HostFromGpu
36.9% 95.1% 0.000s 1.93e-05s C 6 6 GpuElemwise{true_div,no_inplace}
4.9% 100.0% 0.000s 3.81e-06s C 4 4 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
7.0% 7.0% 0.000s 2.19e-05s 1 0 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.7% 13.7% 0.000s 2.10e-05s 1 1 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.4% 20.0% 0.000s 2.00e-05s 1 3 GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.1% 26.1% 0.000s 1.91e-05s 1 7 GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.0% 32.1% 0.000s 1.88e-05s 1 8 GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
6.0% 38.1% 0.000s 1.88e-05s 1 6 GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.8% 43.9% 0.000s 1.81e-05s 1 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.8% 49.6% 0.000s 1.81e-05s 1 2 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.7% 55.3% 0.000s 1.79e-05s 1 12 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.5% 60.8% 0.000s 1.72e-05s 1 11 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.2% 65.9% 0.000s 1.62e-05s 1 13 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.1% 71.0% 0.000s 1.60e-05s 1 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
5.1% 76.1% 0.000s 1.60e-05s 1 5 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.8% 80.9% 0.000s 1.50e-05s 1 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.8% 85.7% 0.000s 1.50e-05s 1 10 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.8% 90.4% 0.000s 1.50e-05s 1 4 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
4.7% 95.1% 0.000s 1.48e-05s 1 9 HostFromGpu(shared_weights_penalty)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
1.6% 96.7% 0.000s 5.01e-06s 1 19 Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
1.3% 98.0% 0.000s 4.05e-06s 1 15 Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
1.0% 99.0% 0.000s 3.10e-06s 1 20 Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0)
input 0: dtype=float64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 1 Apply instances account for 0.99%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 21 Apply account for 100B/100B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
Time in 1 calls to Function.__call__: 4.639626e-04s
Time in Function.fn.__call__: 2.970695e-04s (64.029%)
Time in thunks: 1.273155e-04s (27.441%)
Total compile time: 4.479136e+00s
Number of Apply nodes: 5
Theano Optimizer time: 1.386118e-02s
Theano validate time: 1.111031e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 6.145954e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.148s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.55e-05s C 5 5 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.000s 2.55e-05s C 5 5 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
59.7% 59.7% 0.000s 7.61e-05s 1 0 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
15.7% 75.5% 0.000s 2.00e-05s 1 1 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
13.5% 89.0% 0.000s 1.72e-05s 1 2 DeepCopyOp(CudaNdarrayConstant{0.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
7.9% 96.8% 0.000s 1.00e-05s 1 3 DeepCopyOp(TensorConstant{0})
input 0: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(), strides=c
3.2% 100.0% 0.000s 4.05e-06s 1 4 DeepCopyOp(TensorConstant{0.0})
input 0: dtype=float64, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 5 Apply account for 28B/28B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 1.401901e-04s
Time in Function.fn.__call__: 1.139641e-04s (81.293%)
Time in thunks: 3.004074e-05s (21.429%)
Total compile time: 4.912266e+00s
Number of Apply nodes: 3
Theano Optimizer time: 1.049495e-02s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 2.658844e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.149s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
66.7% 66.7% 0.000s 2.00e-05s C 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu
19.8% 86.5% 0.000s 5.96e-06s C 1 1 theano.compile.ops.DeepCopyOp
13.5% 100.0% 0.000s 4.05e-06s C 1 1 theano.tensor.elemwise.Elemwise
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
66.7% 66.7% 0.000s 2.00e-05s C 1 1 HostFromGpu
19.8% 86.5% 0.000s 5.96e-06s C 1 1 DeepCopyOp
13.5% 100.0% 0.000s 4.05e-06s C 1 1 Elemwise{true_div,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
66.7% 66.7% 0.000s 2.00e-05s 1 1 HostFromGpu(shared_None)
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=c
19.8% 86.5% 0.000s 5.96e-06s 1 0 DeepCopyOp(shared_batch_size)
input 0: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(), strides=c
13.5% 100.0% 0.000s 4.05e-06s 1 2 Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0)
input 0: dtype=float64, shape=(), strides=c
input 1: dtype=float32, shape=(), strides=c
output 0: dtype=float64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 3 Apply account for 20B/20B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
Time in 61 calls to Function.__call__: 1.211181e+01s
Time in Function.fn.__call__: 1.210473e+01s (99.942%)
Time in thunks: 1.171248e+01s (96.703%)
Total compile time: 1.925457e+01s
Number of Apply nodes: 274
Theano Optimizer time: 5.967708e+00s
Theano validate time: 2.864373e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 9.222651e+00s
Import time 3.308520e-01s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.150s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
74.1% 74.1% 8.684s 1.42e-01s Py 61 1 lvsr.ops.EditDistanceOp
24.4% 98.5% 2.853s 2.34e-02s Py 122 2 theano.scan_module.scan_op.Scan
0.5% 99.0% 0.064s 2.10e-04s C 305 5 theano.sandbox.cuda.blas.GpuDot22
0.2% 99.2% 0.023s 4.16e-05s C 549 9 theano.sandbox.cuda.basic_ops.GpuElemwise
0.2% 99.4% 0.021s 2.93e-06s C 7259 119 theano.tensor.elemwise.Elemwise
0.1% 99.5% 0.012s 1.93e-04s C 61 1 theano.sandbox.cuda.basic_ops.GpuJoin
0.1% 99.6% 0.008s 3.45e-05s C 244 4 theano.sandbox.cuda.basic_ops.GpuAlloc
0.1% 99.7% 0.007s 2.32e-05s C 305 5 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.0% 99.7% 0.005s 8.45e-05s C 61 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
0.0% 99.7% 0.004s 2.97e-06s C 1464 24 theano.compile.ops.Shape_i
0.0% 99.8% 0.004s 2.14e-05s C 183 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.0% 99.8% 0.004s 3.71e-06s C 976 16 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 99.8% 0.003s 2.92e-06s C 1098 18 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 99.9% 0.003s 2.92e-06s C 1037 17 theano.tensor.opt.MakeVector
0.0% 99.9% 0.003s 2.41e-05s C 122 2 theano.compile.ops.DeepCopyOp
0.0% 99.9% 0.002s 2.38e-06s C 1037 17 theano.tensor.basic.ScalarFromTensor
0.0% 99.9% 0.002s 7.78e-06s Py 305 3 theano.ifelse.IfElse
0.0% 99.9% 0.002s 4.07e-06s C 549 9 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.002s 5.39e-06s C 305 5 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.001s 6.56e-06s Py 183 3 theano.compile.ops.Rebroadcast
... (remaining 8 Classes account for 0.03%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
74.1% 74.1% 8.684s 1.42e-01s Py 61 1 EditDistanceOp
19.5% 93.7% 2.286s 3.75e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan}
4.8% 98.5% 0.567s 9.29e-03s Py 61 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
0.5% 99.0% 0.064s 2.10e-04s C 305 5 GpuDot22
0.2% 99.2% 0.018s 5.83e-05s C 305 5 GpuElemwise{Add}[(0, 0)]
0.1% 99.3% 0.012s 1.93e-04s C 61 1 GpuJoin
0.1% 99.4% 0.007s 2.32e-05s C 305 5 GpuIncSubtensor{InplaceSet;:int64:}
0.1% 99.4% 0.007s 3.83e-05s C 183 3 GpuAlloc
0.0% 99.5% 0.005s 8.45e-05s C 61 1 GpuAdvancedSubtensor1
0.0% 99.5% 0.004s 2.14e-05s C 183 3 HostFromGpu
0.0% 99.5% 0.004s 2.10e-05s C 183 3 GpuElemwise{sub,no_inplace}
0.0% 99.6% 0.003s 2.92e-06s C 1037 17 MakeVector{dtype='int64'}
0.0% 99.6% 0.003s 2.41e-05s C 122 2 DeepCopyOp
0.0% 99.6% 0.002s 3.71e-06s C 671 11 GpuReshape{2}
0.0% 99.6% 0.002s 2.38e-06s C 1037 17 ScalarFromTensor
0.0% 99.6% 0.002s 2.81e-06s C 793 13 Shape_i{0}
0.0% 99.7% 0.002s 3.16e-06s C 671 11 Shape_i{1}
0.0% 99.7% 0.002s 2.76e-06s C 671 11 Elemwise{add,no_inplace}
0.0% 99.7% 0.002s 2.72e-06s C 610 10 Elemwise{sub,no_inplace}
0.0% 99.7% 0.002s 5.39e-06s C 305 5 GpuAllocEmpty
... (remaining 72 Ops account for 0.30%(0.03s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
74.1% 74.1% 8.684s 1.42e-01s 61 269 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
input 0: dtype=int64, shape=(15, 75), strides=c
input 1: dtype=float32, shape=(15, 75), strides=c
input 2: dtype=int64, shape=(12, 75), strides=c
input 3: dtype=float32, shape=(12, 75), strides=c
output 0: dtype=int64, shape=(15, 75, 1), strides=c
19.5% 93.7% 2.286s 3.75e-02s 61 260 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 75), strides=(75, 1)
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(75,), strides=c
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 75), strides=c
4.8% 98.5% 0.567s 9.29e-03s 61 247 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1)
input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0)
input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.2% 98.7% 0.019s 3.10e-04s 61 140 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.2% 98.8% 0.018s 3.03e-04s 61 142 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.1% 98.9% 0.012s 1.93e-04s 61 255 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.1% 99.0% 0.011s 1.85e-04s 61 257 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.1% 99.1% 0.008s 1.27e-04s 61 139 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.1% 99.1% 0.008s 1.24e-04s 61 141 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.2% 0.005s 8.45e-05s 61 65 GpuAdvancedSubtensor1(W, Reshape{1}.0)
input 0: dtype=float32, shape=(44, 100), strides=c
input 1: dtype=int64, shape=(900,), strides=c
output 0: dtype=float32, shape=(900, 100), strides=(100, 1)
0.0% 99.2% 0.005s 7.52e-05s 61 170 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.0% 99.3% 0.005s 7.49e-05s 61 172 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
0.0% 99.3% 0.003s 4.81e-05s 61 169 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.3% 0.003s 4.72e-05s 61 259 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.3% 0.003s 4.63e-05s 61 171 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.0% 99.4% 0.002s 4.08e-05s 61 47 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.4% 0.002s 3.73e-05s 61 107 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.4% 0.002s 3.67e-05s 61 59 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
0.0% 99.4% 0.002s 3.37e-05s 61 160 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 1: dtype=float32, shape=(1, 92160), strides=(0, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(2, 92160), strides=(92160, 1)
0.0% 99.4% 0.002s 2.63e-05s 61 4 DeepCopyOp(CudaNdarrayConstant{1.0})
input 0: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(), strides=()
... (remaining 254 Apply instances account for 0.56%(0.07s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 22KB (22KB)
GPU: 3175KB (3660KB)
CPU + GPU: 3197KB (3682KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 22KB (22KB)
GPU: 3526KB (4334KB)
CPU + GPU: 3548KB (4356KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 36KB
GPU: 5187KB
CPU + GPU: 5223KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
360000B [(12, 75, 100)] c GpuAllocEmpty(Elemwise{add,no_inplace}.0, Elemwise{Switch}[(0, 1)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
360000B [(900, 100)] c GpuAdvancedSubtensor1(W, Reshape{1}.0)
360000B [(12, 75, 100)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
... (remaining 254 Apply account for 6802854B/19219614B ((35.40%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 61 calls of the op (for a total of 732 steps) 5.621994e-01s
Total time spent in calling the VM 5.386684e-01s (95.814%)
Total overhead (computing slices..) 2.353096e-02s (4.186%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
68.8% 68.8% 0.229s 7.80e-05s C 2928 4 theano.sandbox.cuda.blas.GpuGemm
28.4% 97.1% 0.094s 2.15e-05s C 4392 6 theano.sandbox.cuda.basic_ops.GpuElemwise
2.9% 100.0% 0.010s 3.25e-06s C 2928 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
68.8% 68.8% 0.229s 7.80e-05s C 2928 4 GpuGemm{no_inplace}
10.5% 79.3% 0.035s 2.38e-05s C 1464 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
9.2% 88.4% 0.030s 2.08e-05s C 1464 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
8.7% 97.1% 0.029s 1.98e-05s C 1464 2 GpuElemwise{mul,no_inplace}
1.5% 98.7% 0.005s 3.44e-06s C 1464 2 GpuSubtensor{::, :int64:}
1.3% 100.0% 0.004s 3.06e-06s C 1464 2 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
23.0% 23.0% 0.076s 1.04e-04s 732 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
22.5% 45.5% 0.075s 1.02e-04s 732 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
11.7% 57.1% 0.039s 5.30e-05s 732 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
11.6% 68.8% 0.039s 5.27e-05s 732 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.3% 74.0% 0.018s 2.40e-05s 732 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(75, 1), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
5.2% 79.3% 0.017s 2.36e-05s 732 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(75, 1), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(75, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
4.6% 83.9% 0.015s 2.09e-05s 732 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
4.6% 88.4% 0.015s 2.07e-05s 732 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
4.4% 92.8% 0.015s 2.00e-05s 732 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
4.3% 97.1% 0.014s 1.96e-05s 732 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.8% 97.9% 0.003s 3.46e-06s 732 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.8% 98.7% 0.002s 3.41e-06s 732 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.7% 99.3% 0.002s 3.10e-06s 732 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
0.7% 100.0% 0.002s 3.01e-06s 732 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 146KB (205KB)
CPU + GPU: 146KB (205KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 146KB (205KB)
CPU + GPU: 146KB (205KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 293KB
CPU + GPU: 293KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
... (remaining 0 Apply account for 0B/540000B ((0.00%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( generator_generate_scan )
==================
Message: None
Time in 61 calls of the op (for a total of 915 steps) 2.276112e+00s
Total time spent in calling the VM 2.183355e+00s (95.925%)
Total overhead (computing slices..) 9.275723e-02s (4.075%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
27.2% 27.2% 0.343s 7.49e-05s C 4575 5 theano.sandbox.cuda.blas.GpuGemm
25.6% 52.8% 0.322s 2.70e-05s C 11895 13 theano.sandbox.cuda.basic_ops.GpuElemwise
21.5% 74.3% 0.271s 5.92e-05s C 4575 5 theano.sandbox.cuda.blas.GpuDot22
8.2% 82.5% 0.103s 2.25e-05s C 4575 5 theano.sandbox.cuda.basic_ops.GpuCAReduce
3.2% 85.7% 0.041s 4.44e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
3.1% 88.8% 0.039s 4.23e-05s C 915 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
2.9% 91.7% 0.037s 2.02e-05s C 1830 2 theano.sandbox.cuda.basic_ops.HostFromGpu
1.9% 93.6% 0.024s 2.64e-05s C 915 1 theano.tensor.basic.MaxAndArgmax
1.1% 94.7% 0.014s 1.51e-05s C 915 1 theano.sandbox.multinomial.MultinomialFromUniform
1.1% 95.8% 0.013s 2.43e-06s C 5490 6 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.0% 96.8% 0.013s 1.43e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuFromHost
0.8% 97.7% 0.011s 2.31e-06s C 4575 5 theano.compile.ops.Shape_i
0.7% 98.4% 0.009s 3.28e-06s C 2745 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.6% 98.9% 0.007s 1.93e-06s C 3660 4 theano.tensor.opt.MakeVector
0.5% 99.4% 0.006s 3.39e-06s C 1830 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.3% 99.8% 0.004s 2.30e-06s C 1830 2 theano.tensor.elemwise.Elemwise
0.2% 100.0% 0.003s 3.25e-06s C 915 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
27.2% 27.2% 0.343s 7.49e-05s C 4575 5 GpuGemm{inplace}
21.5% 48.8% 0.271s 5.92e-05s C 4575 5 GpuDot22
7.0% 55.7% 0.088s 4.80e-05s C 1830 2 GpuElemwise{mul,no_inplace}
3.5% 59.2% 0.043s 4.75e-05s C 915 1 GpuElemwise{add,no_inplace}
3.2% 62.4% 0.041s 4.44e-05s C 915 1 GpuAdvancedSubtensor1
3.1% 65.5% 0.039s 4.23e-05s C 915 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
2.9% 68.4% 0.037s 2.02e-05s C 1830 2 HostFromGpu
2.2% 70.6% 0.028s 3.01e-05s C 915 1 GpuCAReduce{add}{1,0,0}
2.2% 72.8% 0.027s 2.98e-05s C 915 1 GpuElemwise{Tanh}[(0, 0)]
1.9% 74.7% 0.024s 2.64e-05s C 915 1 MaxAndArgmax
1.9% 76.5% 0.023s 2.56e-05s C 915 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
1.8% 78.3% 0.022s 2.42e-05s C 915 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
1.7% 80.0% 0.021s 2.28e-05s C 915 1 GpuCAReduce{maximum}{0,1}
1.6% 81.6% 0.020s 2.24e-05s C 915 1 GpuCAReduce{maximum}{1,0}
1.4% 83.0% 0.018s 1.92e-05s C 915 1 GpuElemwise{Add}[(0, 1)]
1.4% 84.4% 0.018s 1.92e-05s C 915 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
1.4% 85.7% 0.017s 1.89e-05s C 915 1 GpuCAReduce{add}{0,1}
1.4% 87.1% 0.017s 1.89e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
1.4% 88.5% 0.017s 1.87e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
1.3% 89.8% 0.017s 1.85e-05s C 915 1 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 20 Ops account for 10.18%(0.13s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
9.7% 9.7% 0.122s 1.34e-04s 915 10 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
7.4% 17.1% 0.093s 1.01e-04s 915 5 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
6.6% 23.7% 0.084s 9.14e-05s 915 32 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
5.6% 29.3% 0.071s 7.73e-05s 915 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(900, 100), strides=c
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(900, 1), strides=c
5.5% 34.8% 0.069s 7.57e-05s 915 56 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
input 0: dtype=float32, shape=(12, 75, 1), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=c
output 0: dtype=float32, shape=(12, 75, 200), strides=c
4.7% 39.5% 0.059s 6.45e-05s 915 38 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.5% 43.0% 0.043s 4.75e-05s 915 43 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 75, 100), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=c
3.2% 46.2% 0.041s 4.44e-05s 915 29 GpuAdvancedSubtensor1(W_copy[cuda], argmax)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(75,), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.2% 49.4% 0.040s 4.35e-05s 915 8 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 44), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 200), strides=c
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
3.1% 52.5% 0.039s 4.25e-05s 915 37 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.1% 55.5% 0.039s 4.23e-05s 915 13 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(92160,), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=float32, shape=(92160,), strides=c
output 1: dtype=float32, shape=(75,), strides=c
3.0% 58.6% 0.038s 4.17e-05s 915 41 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
3.0% 61.6% 0.038s 4.14e-05s 915 39 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
2.4% 64.0% 0.031s 3.35e-05s 915 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
input 0: dtype=float32, shape=(75, 100), strides=c
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
2.2% 66.2% 0.028s 3.01e-05s 915 57 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
input 0: dtype=float32, shape=(12, 75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
2.2% 68.4% 0.027s 2.98e-05s 915 45 GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=c
output 0: dtype=float32, shape=(900, 100), strides=c
1.9% 70.3% 0.024s 2.64e-05s 915 27 MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1})
input 0: dtype=int64, shape=(75, 44), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=int64, shape=(75,), strides=c
output 1: dtype=int64, shape=(75,), strides=c
1.9% 72.1% 0.023s 2.56e-05s 915 33 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(75, 200), strides=c
output 0: dtype=float32, shape=(75, 200), strides=c
1.8% 73.9% 0.022s 2.42e-05s 915 40 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(75, 100), strides=c
input 2: dtype=float32, shape=(75, 100), strides=c
input 3: dtype=float32, shape=(75, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(75, 100), strides=c
1.7% 75.6% 0.021s 2.29e-05s 915 25 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(75, 44), strides=c
output 0: dtype=float32, shape=(75, 44), strides=c
... (remaining 38 Apply instances account for 24.45%(0.31s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 39KB (39KB)
GPU: 1151KB (1151KB)
CPU + GPU: 1190KB (1190KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 39KB (39KB)
GPU: 1151KB (1151KB)
CPU + GPU: 1190KB (1190KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41KB
GPU: 1709KB
CPU + GPU: 1750KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
720000B [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
368940B [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
360000B [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
360000B [(12, 75, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
360000B [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
60000B [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
60000B [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
60000B [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
60000B [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
60000B [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
30000B [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
30000B [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
30000B [(1, 75, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
30000B [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
30000B [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
30000B [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
30000B [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
... (remaining 38 Apply account for 158879B/2927819B ((5.43%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103
Time in 11 calls to Function.__call__: 1.319449e-01s
Time in Function.fn.__call__: 1.316185e-01s (99.753%)
Time in thunks: 8.657598e-02s (65.615%)
Total compile time: 1.813622e+01s
Number of Apply nodes: 183
Theano Optimizer time: 4.002905e+00s
Theano validate time: 1.576922e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 1.015641e+01s
Import time 6.427932e-02s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.235s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
88.6% 88.6% 0.077s 3.49e-03s Py 22 2 theano.scan_module.scan_op.Scan
3.5% 92.1% 0.003s 2.82e-06s C 1089 99 theano.tensor.elemwise.Elemwise
1.5% 93.6% 0.001s 2.96e-05s C 44 4 theano.sandbox.cuda.blas.GpuDot22
1.0% 94.6% 0.001s 1.91e-05s C 44 4 theano.sandbox.cuda.basic_ops.GpuElemwise
0.7% 95.3% 0.001s 5.69e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuJoin
0.6% 95.9% 0.000s 2.26e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuAlloc
0.6% 96.5% 0.000s 4.40e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
0.5% 97.0% 0.000s 3.59e-06s C 121 11 theano.sandbox.cuda.basic_ops.GpuReshape
0.5% 97.5% 0.000s 1.97e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.5% 98.0% 0.000s 2.91e-06s C 143 13 theano.compile.ops.Shape_i
0.4% 98.4% 0.000s 2.84e-06s C 121 11 theano.tensor.opt.MakeVector
0.4% 98.7% 0.000s 2.31e-06s C 132 12 theano.tensor.basic.ScalarFromTensor
0.3% 99.0% 0.000s 4.05e-06s C 66 6 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.3% 99.3% 0.000s 2.84e-06s C 88 8 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 99.6% 0.000s 2.26e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.2% 99.8% 0.000s 6.34e-06s Py 22 2 theano.compile.ops.Rebroadcast
0.1% 99.9% 0.000s 5.42e-06s C 22 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.1% 100.0% 0.000s 5.35e-06s C 11 1 theano.tensor.basic.Alloc
0.0% 100.0% 0.000s 3.19e-06s C 11 1 theano.tensor.basic.Reshape
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
88.6% 88.6% 0.077s 3.49e-03s Py 22 2 forall_inplace,gpu,gatedrecurrent_apply_scan}
1.5% 90.1% 0.001s 2.96e-05s C 44 4 GpuDot22
1.0% 91.1% 0.001s 1.91e-05s C 44 4 GpuElemwise{Add}[(0, 0)]
0.7% 91.8% 0.001s 5.69e-05s C 11 1 GpuJoin
0.6% 92.4% 0.000s 2.26e-05s C 22 2 GpuAlloc
0.6% 92.9% 0.000s 4.40e-05s C 11 1 GpuAdvancedSubtensor1
0.5% 93.4% 0.000s 1.97e-05s C 22 2 GpuIncSubtensor{InplaceSet;:int64:}
0.4% 93.8% 0.000s 2.84e-06s C 121 11 MakeVector{dtype='int64'}
0.4% 94.2% 0.000s 2.31e-06s C 132 12 ScalarFromTensor
0.3% 94.5% 0.000s 3.66e-06s C 77 7 GpuReshape{2}
0.3% 94.8% 0.000s 2.80e-06s C 99 9 Elemwise{add,no_inplace}
0.3% 95.1% 0.000s 2.26e-05s C 11 1 HostFromGpu
0.3% 95.4% 0.000s 3.05e-06s C 77 7 Shape_i{0}
0.3% 95.6% 0.000s 2.60e-06s C 88 8 Elemwise{le,no_inplace}
0.2% 95.9% 0.000s 2.94e-06s C 66 6 GpuDimShuffle{x,x,0}
0.2% 96.1% 0.000s 2.75e-06s C 66 6 Shape_i{1}
0.2% 96.3% 0.000s 2.53e-06s C 66 6 Elemwise{sub,no_inplace}
0.2% 96.5% 0.000s 2.98e-06s C 55 5 Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)]
0.2% 96.6% 0.000s 3.46e-06s C 44 4 GpuReshape{3}
0.2% 96.8% 0.000s 2.60e-06s C 55 5 Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}
... (remaining 54 Ops account for 3.19%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
44.7% 44.7% 0.039s 3.52e-03s 11 133 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
43.9% 88.6% 0.038s 3.46e-03s 11 175 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.7% 89.3% 0.001s 5.69e-05s 11 181 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
input 0: dtype=int8, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.6% 89.9% 0.000s 4.40e-05s 11 26 GpuAdvancedSubtensor1(W, Reshape{1}.0)
input 0: dtype=float32, shape=(44, 100), strides=c
input 1: dtype=int64, shape=(12,), strides=c
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.4% 90.3% 0.000s 3.07e-05s 11 51 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(12, 200), strides=(200, 1)
0.4% 90.7% 0.000s 3.06e-05s 11 49 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(12, 200), strides=(200, 1)
0.4% 91.0% 0.000s 2.92e-05s 11 50 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.4% 91.4% 0.000s 2.80e-05s 11 48 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
0.3% 91.7% 0.000s 2.40e-05s 11 96 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.3% 92.0% 0.000s 2.26e-05s 11 182 HostFromGpu(GpuJoin.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=c
0.3% 92.2% 0.000s 2.12e-05s 11 64 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.3% 92.5% 0.000s 2.05e-05s 11 130 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.3% 92.8% 0.000s 2.02e-05s 11 71 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 93.0% 0.000s 1.92e-05s 11 73 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 93.2% 0.000s 1.89e-05s 11 160 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.2% 93.5% 0.000s 1.87e-05s 11 72 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.2% 93.7% 0.000s 1.85e-05s 11 74 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
0.1% 93.8% 0.000s 6.39e-06s 11 125 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
0.1% 93.9% 0.000s 6.35e-06s 11 159 Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(GE(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 + i3 + i4 + i5), i0)}((Composite{((Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 - i3), i0)}(Composite{((i0 - (Switch(LT(i1, i2), i2, i1) - i3)) - i4)}(i0, Composite{(((i0 - i1) // i2) + i3)}(i1, i2, i3, i4), i5, i6, i7),
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int8, shape=(), strides=c
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=int8, shape=(), strides=c
input 6: dtype=int8, shape=(), strides=c
input 7: dtype=int8, shape=(), strides=c
input 8: dtype=int8, shape=(), strides=c
input 9: dtype=int8, shape=(), strides=c
input 10: dtype=int64, shape=(), strides=c
input 11: dtype=int64, shape=(), strides=c
input 12: dtype=int8, shape=(), strides=c
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(), strides=c
0.1% 94.0% 0.000s 6.29e-06s 11 91 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
... (remaining 163 Apply instances account for 6.04%(0.01s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 9KB (9KB)
GPU: 28KB (34KB)
CPU + GPU: 38KB (43KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 9KB (9KB)
GPU: 33KB (38KB)
CPU + GPU: 42KB (48KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 10KB
GPU: 52KB
CPU + GPU: 63KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
80000B [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
80000B [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
40000B [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
40000B [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
9600B [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
9600B [(12, 1, 200)] c HostFromGpu(GpuJoin.0)
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
4800B [(12, 1, 100)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
4800B [(12, 1, 100)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
4800B [(12, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
4800B [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
... (remaining 163 Apply account for 65253B/430053B ((15.17%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan )
==================
Message: None
Time in 11 calls of the op (for a total of 132 steps) 3.813338e-02s
Total time spent in calling the VM 3.587055e-02s (94.066%)
Total overhead (computing slices..) 2.262831e-03s (5.934%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
55.7% 55.7% 0.009s 3.56e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm
39.3% 95.0% 0.007s 1.67e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise
5.0% 100.0% 0.001s 3.22e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
55.7% 55.7% 0.009s 3.56e-05s C 264 2 GpuGemm{no_inplace}
13.4% 69.1% 0.002s 1.71e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
13.0% 82.1% 0.002s 1.66e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
12.9% 95.0% 0.002s 1.64e-05s C 132 1 GpuElemwise{mul,no_inplace}
2.7% 97.6% 0.000s 3.42e-06s C 132 1 GpuSubtensor{::, :int64:}
2.4% 100.0% 0.000s 3.02e-06s C 132 1 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
29.8% 29.8% 0.005s 3.80e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
25.9% 55.7% 0.004s 3.31e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
13.4% 69.1% 0.002s 1.71e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
13.0% 82.1% 0.002s 1.66e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
12.9% 95.0% 0.002s 1.64e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.7% 97.6% 0.000s 3.42e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.4% 100.0% 0.000s 3.02e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan )
==================
Message: None
Time in 11 calls of the op (for a total of 132 steps) 3.749466e-02s
Total time spent in calling the VM 3.560066e-02s (94.949%)
Total overhead (computing slices..) 1.893997e-03s (5.051%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
55.9% 55.9% 0.009s 3.55e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm
39.2% 95.0% 0.007s 1.66e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise
5.0% 100.0% 0.001s 3.18e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
55.9% 55.9% 0.009s 3.55e-05s C 264 2 GpuGemm{no_inplace}
13.3% 69.1% 0.002s 1.69e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
13.0% 82.1% 0.002s 1.65e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
12.9% 95.0% 0.002s 1.65e-05s C 132 1 GpuElemwise{mul,no_inplace}
2.6% 97.6% 0.000s 3.31e-06s C 132 1 GpuSubtensor{::, :int64:}
2.4% 100.0% 0.000s 3.04e-06s C 132 1 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
29.8% 29.8% 0.005s 3.79e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
26.1% 55.9% 0.004s 3.32e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
13.3% 69.1% 0.002s 1.69e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
input 2: dtype=float32, shape=(1, 100), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
13.0% 82.1% 0.002s 1.65e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
12.9% 95.0% 0.002s 1.65e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
2.6% 97.6% 0.000s 3.31e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
2.4% 100.0% 0.000s 3.04e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 2KB (2KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111
Time in 11 calls to Function.__call__: 2.414465e-03s
Time in Function.fn.__call__: 2.146721e-03s (88.911%)
Time in thunks: 4.596710e-04s (19.038%)
Total compile time: 5.729262e+00s
Number of Apply nodes: 8
Theano Optimizer time: 3.657293e-02s
Theano validate time: 4.487038e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 1.374197e-02s
Import time 5.259037e-03s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.290s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
43.0% 43.0% 0.000s 1.80e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu
20.6% 63.6% 0.000s 4.30e-06s C 22 2 theano.tensor.basic.Alloc
14.9% 78.5% 0.000s 3.12e-06s C 22 2 theano.compile.ops.Shape_i
7.2% 85.7% 0.000s 3.01e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle
7.2% 92.9% 0.000s 2.99e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuReshape
7.1% 100.0% 0.000s 2.97e-06s C 11 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
43.0% 43.0% 0.000s 1.80e-05s C 11 1 HostFromGpu
20.6% 63.6% 0.000s 4.30e-06s C 22 2 Alloc
14.9% 78.5% 0.000s 3.12e-06s C 22 2 Shape_i{0}
7.2% 85.7% 0.000s 3.01e-06s C 11 1 GpuDimShuffle{x,x,0}
7.2% 92.9% 0.000s 2.99e-06s C 11 1 GpuReshape{2}
7.1% 100.0% 0.000s 2.97e-06s C 11 1 MakeVector{dtype='int64'}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
43.0% 43.0% 0.000s 1.80e-05s 11 7 HostFromGpu(GpuReshape{2}.0)
input 0: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
10.8% 53.8% 0.000s 4.53e-06s 11 4 Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(1, 12), strides=c
9.8% 63.6% 0.000s 4.07e-06s 11 1 Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200})
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int16, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=c
8.5% 72.0% 0.000s 3.53e-06s 11 0 Shape_i{0}(generator_generate_attended)
input 0: dtype=float32, shape=(12, 1, 200), strides=c
output 0: dtype=int64, shape=(), strides=c
7.2% 79.3% 0.000s 3.01e-06s 11 3 GpuDimShuffle{x,x,0}(initial_state)
input 0: dtype=float32, shape=(100,), strides=c
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
7.2% 86.4% 0.000s 2.99e-06s 11 6 GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1)
input 1: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(1, 100), strides=c
7.1% 93.5% 0.000s 2.97e-06s 11 5 MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
output 0: dtype=int64, shape=(2,), strides=c
6.5% 100.0% 0.000s 2.71e-06s 11 2 Shape_i{0}(initial_state)
input 0: dtype=float32, shape=(100,), strides=c
output 0: dtype=int64, shape=(), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 1KB (1KB)
GPU: 0KB (0KB)
CPU + GPU: 1KB (1KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 1KB (1KB)
GPU: 0KB (0KB)
CPU + GPU: 1KB (1KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 1KB
GPU: 0KB
CPU + GPU: 1KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126
Time in 176 calls to Function.__call__: 4.031258e-01s
Time in Function.fn.__call__: 3.963535e-01s (98.320%)
Time in thunks: 1.376257e-01s (34.140%)
Total compile time: 6.464948e+00s
Number of Apply nodes: 75
Theano Optimizer time: 4.475892e-01s
Theano validate time: 2.268028e-02s
Theano Linker time (includes C, CUDA code generation/compiling): 1.257081e-01s
Import time 3.001761e-02s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.292s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
22.7% 22.7% 0.031s 1.77e-05s C 1760 10 theano.sandbox.cuda.basic_ops.GpuElemwise
17.7% 40.4% 0.024s 2.77e-05s C 880 5 theano.sandbox.cuda.blas.GpuDot22
14.8% 55.3% 0.020s 2.90e-05s C 704 4 theano.sandbox.cuda.blas.GpuGemm
8.6% 63.9% 0.012s 1.34e-05s C 880 5 theano.sandbox.cuda.basic_ops.GpuFromHost
8.1% 72.0% 0.011s 1.58e-05s C 704 4 theano.sandbox.cuda.basic_ops.HostFromGpu
7.6% 79.6% 0.011s 1.99e-05s C 528 3 theano.sandbox.cuda.basic_ops.GpuCAReduce
5.3% 84.9% 0.007s 4.17e-05s C 176 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
3.0% 87.9% 0.004s 2.90e-06s C 1408 8 theano.sandbox.cuda.basic_ops.GpuDimShuffle
2.9% 90.8% 0.004s 3.28e-06s C 1232 7 theano.sandbox.cuda.basic_ops.GpuReshape
2.8% 93.6% 0.004s 2.43e-06s C 1584 9 theano.tensor.elemwise.Elemwise
2.6% 96.2% 0.004s 2.54e-06s C 1408 8 theano.compile.ops.Shape_i
2.3% 98.5% 0.003s 2.52e-06s C 1232 7 theano.tensor.opt.MakeVector
0.9% 99.4% 0.001s 3.37e-06s C 352 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.3% 99.7% 0.000s 2.54e-06s C 176 1 theano.tensor.elemwise.All
0.3% 100.0% 0.000s 2.43e-06s C 176 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.7% 17.7% 0.024s 2.77e-05s C 880 5 GpuDot22
14.8% 32.6% 0.020s 2.90e-05s C 704 4 GpuGemm{inplace}
8.6% 41.2% 0.012s 1.34e-05s C 880 5 GpuFromHost
8.1% 49.3% 0.011s 1.58e-05s C 704 4 HostFromGpu
5.3% 54.6% 0.007s 4.17e-05s C 176 1 GpuAdvancedSubtensor1
2.9% 57.5% 0.004s 2.23e-05s C 176 1 GpuCAReduce{maximum}{1,0}
2.5% 59.9% 0.003s 3.22e-06s C 1056 6 GpuReshape{2}
2.4% 62.4% 0.003s 1.91e-05s C 176 1 GpuCAReduce{add}{1,0,0}
2.4% 64.8% 0.003s 1.90e-05s C 176 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)]
2.4% 67.2% 0.003s 1.87e-05s C 176 1 GpuElemwise{Mul}[(0, 1)]
2.4% 69.5% 0.003s 1.84e-05s C 176 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)]
2.3% 71.9% 0.003s 1.84e-05s C 176 1 GpuElemwise{mul,no_inplace}
2.3% 74.2% 0.003s 1.83e-05s C 176 1 GpuCAReduce{add}{1,0}
2.3% 76.5% 0.003s 1.78e-05s C 176 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
2.3% 78.8% 0.003s 2.52e-06s C 1232 7 MakeVector{dtype='int64'}
2.2% 81.0% 0.003s 1.73e-05s C 176 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
2.2% 83.2% 0.003s 1.71e-05s C 176 1 GpuElemwise{Sub}[(0, 1)]
2.2% 85.4% 0.003s 1.71e-05s C 176 1 GpuElemwise{Add}[(0, 0)]
2.2% 87.5% 0.003s 1.69e-05s C 176 1 GpuElemwise{TrueDiv}[(0, 0)]
2.1% 89.7% 0.003s 1.67e-05s C 176 1 GpuElemwise{Tanh}[(0, 0)]
... (remaining 21 Ops account for 10.34%(0.01s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.3% 5.3% 0.007s 4.17e-05s 176 10 GpuAdvancedSubtensor1(W, readout_sample_samples)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(1,), strides=(16,)
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
4.5% 9.9% 0.006s 3.54e-05s 176 26 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
4.3% 14.2% 0.006s 3.36e-05s 176 34 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 200), strides=(200, 1)
input 1: dtype=float32, shape=(200, 100), strides=(100, 1)
output 0: dtype=float32, shape=(12, 100), strides=(100, 1)
3.8% 18.0% 0.005s 3.01e-05s 176 44 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
3.8% 21.8% 0.005s 2.96e-05s 176 21 GpuDot22(GpuFromHost.0, state_to_gates)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
3.4% 25.2% 0.005s 2.68e-05s 176 30 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
3.3% 28.5% 0.005s 2.60e-05s 176 43 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
3.2% 31.8% 0.004s 2.54e-05s 176 47 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
3.1% 34.9% 0.004s 2.42e-05s 176 53 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 1), strides=(1, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
3.0% 37.9% 0.004s 2.38e-05s 176 45 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.9% 40.8% 0.004s 2.23e-05s 176 55 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(1,), strides=(0,)
2.4% 43.2% 0.003s 1.91e-05s 176 73 GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0)
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
2.4% 45.6% 0.003s 1.90e-05s 176 50 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 1: dtype=float32, shape=(1, 1, 100), strides=c
input 2: dtype=float32, shape=(1, 1, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=c
2.4% 48.0% 0.003s 1.87e-05s 176 72 GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0)
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
2.4% 50.4% 0.003s 1.84e-05s 176 46 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
input 2: dtype=float32, shape=(1, 100), strides=(0, 1)
input 3: dtype=float32, shape=(1, 100), strides=(0, 1)
input 4: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.3% 52.7% 0.003s 1.84e-05s 176 42 GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
2.3% 55.1% 0.003s 1.83e-05s 176 59 GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(1,), strides=(0,)
2.3% 57.4% 0.003s 1.78e-05s 176 57 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0)
input 0: dtype=float32, shape=(12, 1), strides=(1, 0)
input 1: dtype=float32, shape=(1, 1), strides=(0, 0)
input 2: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
2.2% 59.6% 0.003s 1.73e-05s 176 35 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=(0, 1)
input 1: dtype=float32, shape=(1, 200), strides=(0, 1)
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
2.2% 61.8% 0.003s 1.71e-05s 176 58 GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[[ 1.]]}, GpuFromHost.0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(12, 1), strides=(1, 0)
output 0: dtype=float32, shape=(12, 1), strides=(1, 0)
... (remaining 55 Apply instances account for 38.24%(0.05s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 1KB (1KB)
GPU: 14KB (16KB)
CPU + GPU: 15KB (18KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 1KB (1KB)
GPU: 14KB (16KB)
CPU + GPU: 15KB (18KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 1KB
GPU: 18KB
CPU + GPU: 20KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
80000B [(200, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
9600B [(12, 200)] v GpuReshape{2}(GpuFromHost.0, MakeVector{dtype='int64'}.0)
9600B [(12, 1, 200)] c GpuFromHost(generator_generate_attended)
9600B [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
4800B [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
4800B [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
4800B [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0)
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
... (remaining 66 Apply account for 13555B/146355B ((9.26%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137
Time in 176 calls to Function.__call__: 9.610200e-02s
Time in Function.fn.__call__: 9.091020e-02s (94.598%)
Time in thunks: 3.702688e-02s (38.529%)
Total compile time: 4.753222e+00s
Number of Apply nodes: 14
Theano Optimizer time: 8.387494e-02s
Theano validate time: 2.176523e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.531886e-02s
Import time 3.646135e-03s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.305s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
32.1% 32.1% 0.012s 1.69e-05s C 704 4 theano.sandbox.cuda.basic_ops.GpuElemwise
17.9% 50.0% 0.007s 1.88e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuCAReduce
12.9% 62.9% 0.005s 1.36e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuFromHost
12.9% 75.8% 0.005s 2.71e-05s C 176 1 theano.sandbox.cuda.blas.GpuGemm
12.4% 88.2% 0.005s 2.61e-05s C 176 1 theano.sandbox.cuda.blas.GpuDot22
7.8% 96.0% 0.003s 1.64e-05s C 176 1 theano.sandbox.cuda.basic_ops.HostFromGpu
4.0% 100.0% 0.001s 2.80e-06s C 528 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
12.9% 12.9% 0.005s 1.36e-05s C 352 2 GpuFromHost
12.9% 25.8% 0.005s 2.71e-05s C 176 1 GpuGemm{inplace}
12.4% 38.2% 0.005s 2.61e-05s C 176 1 GpuDot22
9.4% 47.6% 0.003s 1.98e-05s C 176 1 GpuCAReduce{maximum}{0,1}
8.7% 56.3% 0.003s 1.83e-05s C 176 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
8.5% 64.8% 0.003s 1.79e-05s C 176 1 GpuCAReduce{add}{0,1}
7.9% 72.8% 0.003s 1.67e-05s C 176 1 GpuElemwise{Add}[(0, 1)]
7.8% 80.6% 0.003s 1.64e-05s C 176 1 HostFromGpu
7.8% 88.3% 0.003s 1.64e-05s C 176 1 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
7.7% 96.0% 0.003s 1.61e-05s C 176 1 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)]
2.5% 98.5% 0.001s 2.65e-06s C 352 2 GpuDimShuffle{0,x}
1.5% 100.0% 0.001s 3.10e-06s C 176 1 GpuDimShuffle{x,0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
12.9% 12.9% 0.005s 2.71e-05s 176 4 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(1, 200), strides=(0, 1)
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
12.4% 25.3% 0.005s 2.61e-05s 176 3 GpuDot22(GpuFromHost.0, W)
input 0: dtype=float32, shape=(1, 100), strides=(0, 1)
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
9.4% 34.7% 0.003s 1.98e-05s 176 6 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1,), strides=(0,)
8.7% 43.4% 0.003s 1.83e-05s 176 8 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 44), strides=c
8.5% 51.9% 0.003s 1.79e-05s 176 9 GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0)
input 0: dtype=float32, shape=(1, 44), strides=c
output 0: dtype=float32, shape=(1,), strides=c
7.9% 59.8% 0.003s 1.67e-05s 176 5 GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
7.8% 67.6% 0.003s 1.64e-05s 176 13 HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
output 0: dtype=float32, shape=(1, 44), strides=c
7.8% 75.4% 0.003s 1.64e-05s 176 11 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 1), strides=c
7.7% 83.1% 0.003s 1.61e-05s 176 12 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(1, 44), strides=(0, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
7.5% 90.6% 0.003s 1.58e-05s 176 0 GpuFromHost(generator_generate_weighted_averages)
input 0: dtype=float32, shape=(1, 200), strides=c
output 0: dtype=float32, shape=(1, 200), strides=(0, 1)
5.4% 96.0% 0.002s 1.15e-05s 176 1 GpuFromHost(generator_generate_states)
input 0: dtype=float32, shape=(1, 100), strides=c
output 0: dtype=float32, shape=(1, 100), strides=(0, 1)
1.5% 97.5% 0.001s 3.10e-06s 176 2 GpuDimShuffle{x,0}(b)
input 0: dtype=float32, shape=(44,), strides=c
output 0: dtype=float32, shape=(1, 44), strides=(0, 1)
1.3% 98.7% 0.000s 2.67e-06s 176 10 GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0)
input 0: dtype=float32, shape=(1,), strides=c
output 0: dtype=float32, shape=(1, 1), strides=c
1.3% 100.0% 0.000s 2.63e-06s 176 7 GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0)
input 0: dtype=float32, shape=(1,), strides=(0,)
output 0: dtype=float32, shape=(1, 1), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 1KB (1KB)
CPU + GPU: 2KB (2KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 1KB (1KB)
CPU + GPU: 2KB (2KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 2KB
CPU + GPU: 2KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
Time in 1 calls to Function.__call__: 1.907349e-05s
Time in Function.fn.__call__: 5.006790e-06s (26.250%)
Total compile time: 5.178439e+00s
Number of Apply nodes: 0
Theano Optimizer time: 5.979061e-03s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 9.393692e-05s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.307s
No execution time accumulated (hint: try config profiling.time_thunks=1)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
Time in 6075 calls to Function.__call__: 3.723266e-01s
Time in Function.fn.__call__: 2.196813e-01s (59.002%)
Time in thunks: 4.040527e-02s (10.852%)
Total compile time: 3.941077e+00s
Number of Apply nodes: 2
Theano Optimizer time: 7.288933e-03s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.483917e-03s
Import time 0.000000e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.307s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.040s 3.33e-06s C 12150 2 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 0.040s 3.33e-06s C 12150 2 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
60.4% 60.4% 0.024s 4.01e-06s 6075 0 DeepCopyOp(labels)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
39.6% 100.0% 0.016s 2.64e-06s 6075 1 DeepCopyOp(inputs)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 0KB (0KB)
CPU + GPU: 0KB (0KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 0KB
CPU + GPU: 0KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
... (remaining 2 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes)
All Apply nodes have output sizes that take less than 1024B.
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253
Time in 100 calls to Function.__call__: 8.755362e+01s
Time in Function.fn.__call__: 8.736853e+01s (99.789%)
Time in thunks: 2.631522e+01s (30.056%)
Total compile time: 2.758291e+02s
Number of Apply nodes: 3579
Theano Optimizer time: 1.544500e+02s
Theano validate time: 5.072355e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.115705e+02s
Import time 1.638190e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 673.308s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
78.3% 78.3% 20.607s 2.94e-02s Py 700 7 theano.scan_module.scan_op.Scan
6.5% 84.8% 1.718s 2.05e-05s C 83700 837 theano.sandbox.cuda.basic_ops.GpuElemwise
3.9% 88.7% 1.028s 1.03e-02s Py 100 1 lvsr.ops.EditDistanceOp
2.5% 91.3% 0.661s 2.67e-05s C 24700 247 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.1% 93.3% 0.548s 7.40e-05s C 7400 74 theano.sandbox.cuda.blas.GpuDot22
1.4% 94.7% 0.367s 3.68e-06s C 99700 997 theano.tensor.elemwise.Elemwise
1.1% 95.8% 0.276s 1.73e-05s C 16000 160 theano.sandbox.cuda.basic_ops.HostFromGpu
0.6% 96.4% 0.164s 2.28e-05s Py 7200 48 theano.ifelse.IfElse
0.6% 97.0% 0.153s 2.74e-05s C 5600 56 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.5% 97.5% 0.134s 8.20e-06s C 16300 163 theano.sandbox.cuda.basic_ops.GpuReshape
0.5% 98.0% 0.129s 2.58e-05s C 5000 50 theano.sandbox.cuda.basic_ops.GpuAlloc
0.4% 98.4% 0.118s 3.42e-06s C 34600 346 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.2% 98.6% 0.056s 1.99e-05s C 2800 28 theano.compile.ops.DeepCopyOp
0.2% 98.8% 0.051s 3.83e-06s C 13300 133 theano.tensor.opt.MakeVector
0.2% 99.0% 0.047s 4.59e-06s C 10200 102 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 99.2% 0.039s 3.61e-06s C 10700 107 theano.compile.ops.Shape_i
0.1% 99.3% 0.037s 1.75e-05s C 2100 21 theano.sandbox.cuda.basic_ops.GpuFromHost
0.1% 99.4% 0.031s 1.02e-04s Py 300 3 theano.sandbox.cuda.basic_ops.GpuSplit
0.1% 99.5% 0.030s 3.04e-06s C 9800 98 theano.tensor.basic.ScalarFromTensor
0.1% 99.6% 0.021s 5.34e-05s C 400 4 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
... (remaining 21 Classes account for 0.39%(0.10s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
33.1% 33.1% 8.707s 8.71e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
15.6% 48.7% 4.113s 2.06e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
13.0% 61.7% 3.412s 3.41e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
11.3% 73.0% 2.984s 2.98e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan}
5.3% 78.3% 1.390s 6.95e-03s Py 200 2 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
3.9% 82.2% 1.028s 1.03e-02s Py 100 1 EditDistanceOp
2.1% 84.3% 0.548s 7.40e-05s C 7400 74 GpuDot22
1.1% 85.3% 0.276s 1.73e-05s C 16000 160 HostFromGpu
1.0% 86.3% 0.262s 3.12e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1}
0.9% 87.2% 0.235s 2.12e-05s C 11100 111 GpuElemwise{add,no_inplace}
0.7% 87.9% 0.182s 2.12e-05s C 8600 86 GpuElemwise{sub,no_inplace}
0.6% 88.5% 0.152s 2.45e-05s Py 6200 39 if{gpu}
0.6% 89.1% 0.148s 2.28e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
0.5% 89.6% 0.143s 2.99e-05s C 4800 48 GpuCAReduce{add}{1,1}
0.5% 90.1% 0.138s 2.16e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
0.5% 90.6% 0.128s 1.97e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
0.5% 91.1% 0.128s 1.88e-05s C 6800 68 GpuElemwise{Mul}[(0, 0)]
0.5% 91.6% 0.127s 2.15e-05s C 5900 59 GpuElemwise{Switch,no_inplace}
0.5% 92.1% 0.126s 1.95e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
0.5% 92.5% 0.121s 2.06e-05s C 5900 59 GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace}
... (remaining 251 Ops account for 7.47%(1.96s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
33.1% 33.1% 8.707s 8.71e-02s 100 2437 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0,
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1)
input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0)
input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1)
input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
input 27: dtype=int64, shape=(), strides=c
input 28: dtype=int64, shape=(), strides=c
input 29: dtype=int64, shape=(), strides=c
input 30: dtype=int64, shape=(), strides=c
input 31: dtype=int64, shape=(), strides=c
input 32: dtype=int64, shape=(), strides=c
input 33: dtype=int64, shape=(), strides=c
input 34: dtype=int64, shape=(), strides=c
input 35: dtype=float32, shape=(100, 200), strides=c
input 36: dtype=float32, shape=(200, 200), strides=c
input 37: dtype=float32, shape=(100, 100), strides=c
input 38: dtype=float32, shape=(200, 100), strides=c
input 39: dtype=float32, shape=(100, 100), strides=c
input 40: dtype=float32, shape=(200, 200), strides=(1, 200)
input 41: dtype=float32, shape=(200, 100), strides=(1, 200)
input 42: dtype=float32, shape=(100, 100), strides=(1, 100)
input 43: dtype=float32, shape=(100, 200), strides=(1, 100)
input 44: dtype=float32, shape=(100, 100), strides=(1, 100)
input 45: dtype=int64, shape=(2,), strides=c
input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 47: dtype=int64, shape=(1,), strides=c
input 48: dtype=float32, shape=(12, 10), strides=(10, 1)
input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 50: dtype=float32, shape=(100, 1), strides=(1, 0)
input 51: dtype=int8, shape=(10,), strides=c
input 52: dtype=float32, shape=(1, 100), strides=(0, 1)
input 53: dtype=float32, shape=(100, 200), strides=c
input 54: dtype=float32, shape=(200, 200), strides=c
input 55: dtype=float32, shape=(100, 100), strides=c
input 56: dtype=float32, shape=(200, 100), strides=c
input 57: dtype=float32, shape=(100, 100), strides=c
input 58: dtype=float32, shape=(200, 200), strides=(1, 200)
input 59: dtype=float32, shape=(200, 100), strides=(1, 200)
input 60: dtype=float32, shape=(100, 100), strides=(1, 100)
input 61: dtype=float32, shape=(100, 200), strides=(1, 100)
input 62: dtype=float32, shape=(100, 100), strides=(1, 100)
input 63: dtype=int64, shape=(2,), strides=c
input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 65: dtype=int64, shape=(1,), strides=c
input 66: dtype=float32, shape=(12, 10), strides=(10, 1)
input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 68: dtype=float32, shape=(100, 1), strides=(1, 0)
input 69: dtype=int8, shape=(10,), strides=c
input 70: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1)
output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1)
13.0% 46.1% 3.412s 3.41e-02s 100 2149 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
input 12: dtype=float32, shape=(100, 200), strides=c
input 13: dtype=float32, shape=(200, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(200, 100), strides=c
input 16: dtype=float32, shape=(100, 100), strides=c
input 17: dtype=float32, shape=(12, 10), strides=(10, 1)
input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 19: dtype=int64, shape=(1,), strides=c
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 21: dtype=int8, shape=(10,), strides=c
input 22: dtype=float32, shape=(100, 1), strides=(1, 0)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(200, 200), strides=c
input 25: dtype=float32, shape=(100, 100), strides=c
input 26: dtype=float32, shape=(200, 100), strides=c
input 27: dtype=float32, shape=(100, 100), strides=c
input 28: dtype=float32, shape=(12, 10), strides=(10, 1)
input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 30: dtype=int64, shape=(1,), strides=c
input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 32: dtype=int8, shape=(10,), strides=c
input 33: dtype=float32, shape=(100, 1), strides=(1, 0)
output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
11.3% 57.4% 2.984s 2.98e-02s 100 1850 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 10), strides=(10, 1)
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(10,), strides=c
input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 10), strides=c
7.8% 65.2% 2.057s 2.06e-02s 100 2632 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=int64, shape=(), strides=c
input 18: dtype=int64, shape=(), strides=c
input 19: dtype=float32, shape=(100, 200), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(200, 100), strides=(1, 200)
input 22: dtype=float32, shape=(100, 100), strides=(1, 100)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
input 25: dtype=float32, shape=(200, 100), strides=(1, 200)
input 26: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
7.8% 73.0% 2.056s 2.06e-02s 100 2631 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=int64, shape=(), strides=c
input 18: dtype=int64, shape=(), strides=c
input 19: dtype=float32, shape=(100, 200), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(200, 100), strides=(1, 200)
input 22: dtype=float32, shape=(100, 100), strides=(1, 100)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
input 25: dtype=float32, shape=(200, 100), strides=(1, 200)
input 26: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
3.9% 76.9% 1.028s 1.03e-02s 100 2005 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11)
input 0: dtype=int64, shape=(15, 10), strides=c
input 1: dtype=float32, shape=(15, 10), strides=c
input 2: dtype=int64, shape=(12, 10), strides=c
input 3: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(15, 10, 1), strides=c
2.6% 79.6% 0.696s 6.96e-03s 100 1642 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
2.6% 82.2% 0.694s 6.94e-03s 100 1652 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
0.0% 82.3% 0.013s 1.31e-04s 100 2467 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(200, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(200, 200), strides=(200, 1)
0.0% 82.3% 0.013s 1.31e-04s 100 2463 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(200, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(200, 200), strides=(200, 1)
0.0% 82.4% 0.013s 1.28e-04s 100 2462 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
0.0% 82.4% 0.012s 1.25e-04s 100 2468 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
0.0% 82.5% 0.012s 1.24e-04s 100 2547 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 150), strides=(1, 100)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
0.0% 82.5% 0.012s 1.19e-04s 100 1117 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(120, 200), strides=(200, 1)
0.0% 82.5% 0.012s 1.16e-04s 100 2486 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(1, 100)
output 0: dtype=float32, shape=(120, 200), strides=(200, 1)
0.0% 82.6% 0.012s 1.16e-04s 100 2540 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 150), strides=(1, 100)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
0.0% 82.6% 0.012s 1.16e-04s 100 2588 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
0.0% 82.7% 0.012s 1.15e-04s 100 1143 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(120, 200), strides=(200, 1)
0.0% 82.7% 0.011s 1.10e-04s 100 2590 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 1: dtype=int8, shape=(), strides=c
input 2: dtype=int64, shape=(2,), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
0.0% 82.8% 0.011s 1.09e-04s 100 2664 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 120), strides=(120, 1)
input 1: dtype=float32, shape=(120, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
... (remaining 3559 Apply instances account for 17.24%(4.54s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 58KB (62KB)
GPU: 3739KB (5373KB)
CPU + GPU: 3797KB (5435KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 57KB (62KB)
GPU: 5605KB (6697KB)
CPU + GPU: 5662KB (6758KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 114KB
GPU: 17091KB
CPU + GPU: 17205KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
750480B [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
391680B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200)] i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0)
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0)
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0)
160000B [(200, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
160000B [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{eq,no_inplace}.0)
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
160000B [(200, 200)] i GpuElemwise{Sub}[(0, 0)](W, GpuElemwise{Switch,no_inplace}.0)
160000B [(200, 200)] c GpuElemwise{Switch,no_inplace}(GpuElemwise{Composite{Cast{float32}(GT((IsNan(i0) + IsInf(i0)), i1))}}[(0, 0)].0, W, GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
160000B [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0)
160000B [(200, 200)] i GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)](GpuDimShuffle{x,x}.0, GpuElemwise{Mul}[(0, 0)].0, GpuDimShuffle{x,x}.0, variance)
... (remaining 3559 Apply account for 46215415B/54155015B ((85.34%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 6.864700e-01s
Total time spent in calling the VM 6.679530e-01s (97.303%)
Total overhead (computing slices..) 1.851702e-02s (2.697%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
53.4% 53.4% 0.172s 3.59e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm
41.7% 95.1% 0.134s 1.87e-05s C 7200 6 theano.sandbox.cuda.basic_ops.GpuElemwise
4.9% 100.0% 0.016s 3.30e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
53.4% 53.4% 0.172s 3.59e-05s C 4800 4 GpuGemm{no_inplace}
15.4% 68.8% 0.050s 2.07e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
13.5% 82.3% 0.043s 1.81e-05s C 2400 2 GpuElemwise{mul,no_inplace}
12.8% 95.1% 0.041s 1.72e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
2.6% 97.7% 0.008s 3.48e-06s C 2400 2 GpuSubtensor{::, :int64:}
2.3% 100.0% 0.008s 3.13e-06s C 2400 2 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
14.5% 14.5% 0.047s 3.90e-05s 1200 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
13.8% 28.3% 0.045s 3.72e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
12.6% 40.9% 0.041s 3.38e-05s 1200 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
12.5% 53.4% 0.040s 3.37e-05s 1200 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.8% 61.2% 0.025s 2.09e-05s 1200 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.6% 68.8% 0.024s 2.04e-05s 1200 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.9% 75.7% 0.022s 1.84e-05s 1200 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.6% 82.3% 0.021s 1.78e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.5% 88.7% 0.021s 1.74e-05s 1200 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
6.4% 95.1% 0.021s 1.71e-05s 1200 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.3% 96.4% 0.004s 3.56e-06s 1200 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.3% 97.7% 0.004s 3.40e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.2% 98.9% 0.004s 3.22e-06s 1200 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.1% 100.0% 0.004s 3.04e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 39KB
CPU + GPU: 39KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
... (remaining 0 Apply account for 0B/72000B ((0.00%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 6.850390e-01s
Total time spent in calling the VM 6.670289e-01s (97.371%)
Total overhead (computing slices..) 1.801014e-02s (2.629%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
53.5% 53.5% 0.172s 3.59e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm
41.6% 95.1% 0.134s 1.86e-05s C 7200 6 theano.sandbox.cuda.basic_ops.GpuElemwise
4.9% 100.0% 0.016s 3.28e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
53.5% 53.5% 0.172s 3.59e-05s C 4800 4 GpuGemm{no_inplace}
15.3% 68.8% 0.049s 2.05e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
13.5% 82.2% 0.043s 1.81e-05s C 2400 2 GpuElemwise{mul,no_inplace}
12.9% 95.1% 0.041s 1.73e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
2.6% 97.7% 0.008s 3.48e-06s C 2400 2 GpuSubtensor{::, :int64:}
2.3% 100.0% 0.007s 3.09e-06s C 2400 2 GpuSubtensor{::, int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
14.5% 14.5% 0.047s 3.90e-05s 1200 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
13.8% 28.3% 0.045s 3.71e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
12.6% 40.9% 0.041s 3.38e-05s 1200 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
12.6% 53.5% 0.041s 3.38e-05s 1200 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.7% 61.2% 0.025s 2.07e-05s 1200 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
7.6% 68.8% 0.024s 2.03e-05s 1200 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
input 5: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.8% 75.6% 0.022s 1.83e-05s 1200 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.6% 82.2% 0.021s 1.79e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
6.4% 88.7% 0.021s 1.73e-05s 1200 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
6.4% 95.1% 0.021s 1.73e-05s 1200 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
input 0: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.3% 96.4% 0.004s 3.49e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.3% 97.7% 0.004s 3.47e-06s 1200 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.1% 98.9% 0.004s 3.09e-06s 1200 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.1% 100.0% 0.004s 3.08e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 20KB (27KB)
CPU + GPU: 20KB (27KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 39KB
CPU + GPU: 39KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
... (remaining 0 Apply account for 0B/72000B ((0.00%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( generator_generate_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 2.965537e+00s
Total time spent in calling the VM 2.812608e+00s (94.843%)
Total overhead (computing slices..) 1.529298e-01s (5.157%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
29.5% 29.5% 0.372s 1.91e-05s C 19500 13 theano.sandbox.cuda.basic_ops.GpuElemwise
17.2% 46.7% 0.217s 2.89e-05s C 7500 5 theano.sandbox.cuda.blas.GpuGemm
15.2% 61.9% 0.192s 2.56e-05s C 7500 5 theano.sandbox.cuda.blas.GpuDot22
11.6% 73.4% 0.146s 1.95e-05s C 7500 5 theano.sandbox.cuda.basic_ops.GpuCAReduce
5.2% 78.6% 0.065s 4.35e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
5.2% 83.8% 0.065s 4.35e-05s C 1500 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
4.4% 88.2% 0.056s 1.87e-05s C 3000 2 theano.sandbox.cuda.basic_ops.HostFromGpu
2.2% 90.4% 0.028s 1.84e-05s C 1500 1 theano.tensor.basic.MaxAndArgmax
1.9% 92.3% 0.024s 1.59e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuFromHost
1.8% 94.1% 0.023s 2.56e-06s C 9000 6 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.4% 95.5% 0.018s 2.38e-06s C 7500 5 theano.compile.ops.Shape_i
1.2% 96.8% 0.015s 3.41e-06s C 4500 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.9% 97.7% 0.012s 1.97e-06s C 6000 4 theano.tensor.opt.MakeVector
0.8% 98.5% 0.010s 3.48e-06s C 3000 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.6% 99.1% 0.007s 2.32e-06s C 3000 2 theano.tensor.elemwise.Elemwise
0.5% 99.6% 0.007s 4.48e-06s C 1500 1 theano.sandbox.multinomial.MultinomialFromUniform
0.4% 100.0% 0.005s 3.31e-06s C 1500 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.2% 17.2% 0.217s 2.89e-05s C 7500 5 GpuGemm{inplace}
15.2% 32.4% 0.192s 2.56e-05s C 7500 5 GpuDot22
5.2% 37.6% 0.065s 4.35e-05s C 1500 1 GpuAdvancedSubtensor1
5.2% 42.7% 0.065s 4.35e-05s C 1500 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
5.0% 47.7% 0.063s 2.10e-05s C 3000 2 GpuElemwise{mul,no_inplace}
4.4% 52.2% 0.056s 1.87e-05s C 3000 2 HostFromGpu
2.8% 55.0% 0.036s 2.37e-05s C 1500 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
2.6% 57.6% 0.033s 2.17e-05s C 1500 1 GpuCAReduce{add}{1,0,0}
2.6% 60.1% 0.032s 2.15e-05s C 1500 1 GpuCAReduce{maximum}{1,0}
2.5% 62.6% 0.031s 2.09e-05s C 1500 1 GpuElemwise{add,no_inplace}
2.3% 64.9% 0.029s 1.91e-05s C 1500 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
2.2% 67.1% 0.028s 1.88e-05s C 1500 1 GpuCAReduce{maximum}{0,1}
2.2% 69.3% 0.028s 1.84e-05s C 1500 1 MaxAndArgmax
2.2% 71.5% 0.027s 1.83e-05s C 1500 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
2.2% 73.6% 0.027s 1.81e-05s C 1500 1 GpuElemwise{Add}[(0, 1)]
2.1% 75.8% 0.027s 1.80e-05s C 1500 1 GpuElemwise{Tanh}[(0, 0)]
2.1% 77.9% 0.027s 1.80e-05s C 1500 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
2.1% 80.0% 0.027s 1.78e-05s C 1500 1 GpuElemwise{TrueDiv}[(0, 0)]
2.1% 82.1% 0.027s 1.77e-05s C 1500 1 GpuCAReduce{add}{1,0}
2.1% 84.2% 0.027s 1.77e-05s C 1500 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
... (remaining 20 Ops account for 15.75%(0.20s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.2% 5.2% 0.065s 4.35e-05s 1500 29 GpuAdvancedSubtensor1(W_copy[cuda], argmax)
input 0: dtype=float32, shape=(45, 100), strides=c
input 1: dtype=int64, shape=(10,), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
5.2% 10.3% 0.065s 4.35e-05s 1500 13 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
input 0: dtype=float32, shape=(92160,), strides=c
input 1: dtype=int64, shape=(1,), strides=c
output 0: dtype=float32, shape=(92160,), strides=c
output 1: dtype=float32, shape=(10,), strides=c
4.2% 14.6% 0.053s 3.55e-05s 1500 10 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
3.6% 18.2% 0.046s 3.06e-05s 1500 38 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.4% 21.6% 0.043s 2.90e-05s 1500 5 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
3.3% 25.0% 0.042s 2.79e-05s 1500 8 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 44), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 44), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
3.2% 28.1% 0.040s 2.67e-05s 1500 32 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
3.0% 31.1% 0.038s 2.52e-05s 1500 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 44), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
2.9% 34.1% 0.037s 2.47e-05s 1500 37 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.9% 37.0% 0.037s 2.46e-05s 1500 41 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.9% 39.9% 0.037s 2.44e-05s 1500 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=c
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=c
2.8% 42.7% 0.036s 2.39e-05s 1500 39 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.8% 45.6% 0.036s 2.37e-05s 1500 40 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
input 0: dtype=float32, shape=(1, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.8% 48.3% 0.035s 2.34e-05s 1500 56 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
input 0: dtype=float32, shape=(12, 10, 1), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
2.6% 50.9% 0.033s 2.17e-05s 1500 57 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
input 0: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.6% 53.5% 0.032s 2.15e-05s 1500 48 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=float32, shape=(10,), strides=c
2.5% 56.0% 0.031s 2.09e-05s 1500 43 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 10, 100), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=c
output 0: dtype=float32, shape=(12, 10, 100), strides=c
2.4% 58.4% 0.030s 2.01e-05s 1500 25 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 44), strides=c
output 0: dtype=float32, shape=(10, 44), strides=c
2.3% 60.6% 0.029s 1.91e-05s 1500 33 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(1, 200), strides=c
input 1: dtype=float32, shape=(10, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.2% 62.9% 0.028s 1.88e-05s 1500 18 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
input 0: dtype=float32, shape=(10, 44), strides=c
output 0: dtype=float32, shape=(10,), strides=c
... (remaining 38 Apply instances account for 37.13%(0.47s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 5KB (5KB)
GPU: 465KB (465KB)
CPU + GPU: 471KB (471KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 5KB (5KB)
GPU: 465KB (465KB)
CPU + GPU: 471KB (471KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 5KB
GPU: 540KB
CPU + GPU: 545KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
8000B [(10, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
4000B [(10, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
4000B [(10, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
4000B [(10, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
4000B [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
4000B [(10, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
4000B [(10, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
... (remaining 38 Apply account for 21274B/709954B ((3.00%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 3.388380e+00s
Total time spent in calling the VM 3.311884e+00s (97.742%)
Total overhead (computing slices..) 7.649612e-02s (2.258%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
37.8% 37.8% 0.518s 1.92e-05s C 27000 18 theano.sandbox.cuda.basic_ops.GpuElemwise
22.4% 60.2% 0.307s 2.56e-05s C 12000 8 theano.sandbox.cuda.blas.GpuDot22
14.3% 74.4% 0.196s 3.26e-05s C 6000 4 theano.sandbox.cuda.blas.GpuGemm
12.6% 87.0% 0.172s 1.92e-05s C 9000 6 theano.sandbox.cuda.basic_ops.GpuCAReduce
3.5% 90.5% 0.047s 1.58e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuFromHost
2.5% 93.0% 0.035s 2.56e-06s C 13500 9 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.6% 94.5% 0.021s 2.37e-06s C 9000 6 theano.compile.ops.Shape_i
1.5% 96.0% 0.020s 3.37e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
1.4% 97.4% 0.019s 3.24e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuReshape
1.0% 98.4% 0.014s 2.29e-06s C 6000 4 theano.tensor.elemwise.Elemwise
0.8% 99.3% 0.012s 1.94e-06s C 6000 4 theano.tensor.opt.MakeVector
0.7% 100.0% 0.010s 3.30e-06s C 3000 2 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
22.4% 22.4% 0.307s 2.56e-05s C 12000 8 GpuDot22
14.3% 36.7% 0.196s 3.26e-05s C 6000 4 GpuGemm{inplace}
9.1% 45.8% 0.125s 2.09e-05s C 6000 4 GpuElemwise{mul,no_inplace}
4.8% 50.6% 0.065s 2.18e-05s C 3000 2 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}
4.7% 55.3% 0.065s 2.17e-05s C 3000 2 GpuCAReduce{maximum}{1,0}
4.5% 59.8% 0.062s 2.07e-05s C 3000 2 GpuElemwise{add,no_inplace}
4.0% 63.8% 0.055s 1.82e-05s C 3000 2 GpuCAReduce{add}{1,0,0}
3.9% 67.7% 0.054s 1.79e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
3.9% 71.6% 0.053s 1.78e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
3.9% 75.5% 0.053s 1.77e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)]
3.9% 79.4% 0.053s 1.77e-05s C 3000 2 GpuElemwise{Tanh}[(0, 0)]
3.9% 83.2% 0.053s 1.76e-05s C 3000 2 GpuCAReduce{add}{1,0}
3.8% 87.0% 0.052s 1.72e-05s C 3000 2 GpuElemwise{Add}[(0, 0)]
3.5% 90.5% 0.047s 1.58e-05s C 3000 2 GpuFromHost
1.4% 91.9% 0.019s 3.24e-06s C 6000 4 GpuReshape{2}
0.8% 92.7% 0.012s 1.94e-06s C 6000 4 MakeVector{dtype='int64'}
0.8% 93.6% 0.012s 2.56e-06s C 4500 3 GpuDimShuffle{x,0}
0.8% 94.3% 0.011s 3.59e-06s C 3000 2 GpuSubtensor{::, :int64:}
0.7% 95.0% 0.009s 3.15e-06s C 3000 2 GpuSubtensor{::, int64::}
0.7% 95.7% 0.009s 3.02e-06s C 3000 2 Shape_i{1}
... (remaining 10 Ops account for 4.30%(0.06s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
3.9% 3.9% 0.053s 3.54e-05s 1500 11 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
3.9% 7.7% 0.053s 3.54e-05s 1500 14 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
3.3% 11.0% 0.045s 2.99e-05s 1500 32 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
3.3% 14.3% 0.045s 2.99e-05s 1500 33 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
3.2% 17.5% 0.044s 2.95e-05s 1500 3 GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
3.1% 20.6% 0.043s 2.87e-05s 1500 8 GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.7% 23.3% 0.037s 2.46e-05s 1500 31 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 26.0% 0.037s 2.45e-05s 1500 30 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 28.7% 0.037s 2.44e-05s 1500 47 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=(1, 0)
2.7% 31.4% 0.037s 2.44e-05s 1500 36 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 34.0% 0.036s 2.43e-05s 1500 37 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 36.7% 0.036s 2.43e-05s 1500 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
input 0: dtype=float32, shape=(120, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 1), strides=c
output 0: dtype=float32, shape=(120, 1), strides=(1, 0)
2.6% 39.2% 0.035s 2.35e-05s 1500 69 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 1: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
2.5% 41.8% 0.035s 2.31e-05s 1500 65 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda])
input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 1: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
2.4% 44.2% 0.033s 2.21e-05s 1500 50 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 10), strides=(10, 1)
output 0: dtype=float32, shape=(10,), strides=(1,)
2.4% 46.6% 0.033s 2.20e-05s 1500 34 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(10, 100), strides=(200, 1)
input 4: dtype=float32, shape=(10, 100), strides=c
input 5: dtype=float32, shape=(1, 1), strides=c
input 6: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 49.0% 0.032s 2.16e-05s 1500 35 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 1), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(10, 100), strides=(200, 1)
input 4: dtype=float32, shape=(10, 100), strides=c
input 5: dtype=float32, shape=(1, 1), strides=c
input 6: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.3% 51.3% 0.032s 2.13e-05s 1500 51 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
input 0: dtype=float32, shape=(12, 10), strides=(10, 1)
output 0: dtype=float32, shape=(10,), strides=(1,)
2.3% 53.6% 0.031s 2.08e-05s 1500 40 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 10, 100), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
2.2% 55.8% 0.031s 2.06e-05s 1500 41 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
input 0: dtype=float32, shape=(12, 10, 100), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
... (remaining 51 Apply instances account for 44.20%(0.61s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 118KB (118KB)
CPU + GPU: 118KB (118KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 118KB (149KB)
CPU + GPU: 118KB (149KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 345KB
CPU + GPU: 345KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda])
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0)
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0)
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
4000B [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
... (remaining 51 Apply account for 53988B/613988B ((8.79%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1500 steps) 8.660982e+00s
Total time spent in calling the VM 8.414116e+00s (97.150%)
Total overhead (computing slices..) 2.468655e-01s (2.850%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
45.6% 45.6% 1.678s 1.86e-05s C 90000 60 theano.sandbox.cuda.basic_ops.GpuElemwise
17.5% 63.1% 0.643s 2.68e-05s C 24000 16 theano.sandbox.cuda.blas.GpuDot22
13.0% 76.1% 0.480s 3.20e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm
9.5% 85.7% 0.351s 1.95e-05s C 18000 12 theano.sandbox.cuda.basic_ops.GpuCAReduce
3.0% 88.7% 0.111s 1.86e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
2.4% 91.1% 0.088s 1.47e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuFromHost
2.2% 93.3% 0.081s 2.47e-06s C 33000 22 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.4% 94.8% 0.053s 1.77e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAlloc
1.4% 96.2% 0.052s 3.47e-06s C 15000 10 theano.sandbox.cuda.basic_ops.GpuReshape
1.1% 97.3% 0.040s 2.25e-06s C 18000 12 theano.compile.ops.Shape_i
0.8% 98.1% 0.030s 2.52e-06s C 12000 8 theano.tensor.elemwise.Elemwise
0.7% 98.8% 0.025s 2.12e-06s C 12000 8 theano.tensor.opt.MakeVector
0.6% 99.4% 0.023s 3.89e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.6% 100.0% 0.021s 3.58e-06s C 6000 4 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.5% 17.5% 0.643s 2.68e-05s C 24000 16 GpuDot22
10.2% 27.7% 0.374s 3.12e-05s C 12000 8 GpuGemm{inplace}
7.7% 35.4% 0.285s 1.90e-05s C 15000 10 GpuElemwise{mul,no_inplace}
4.6% 40.0% 0.168s 1.86e-05s C 9000 6 GpuElemwise{add,no_inplace}
4.2% 44.2% 0.156s 1.73e-05s C 9000 6 GpuCAReduce{add}{1,0}
3.7% 47.9% 0.136s 1.81e-05s C 7500 5 GpuElemwise{Add}[(0, 1)]
3.5% 51.4% 0.128s 1.71e-05s C 7500 5 GpuElemwise{Add}[(0, 0)]
2.9% 54.3% 0.106s 3.53e-05s C 3000 2 GpuGemm{no_inplace}
2.4% 56.7% 0.088s 1.47e-05s C 6000 4 GpuFromHost
2.3% 58.9% 0.084s 2.81e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
1.9% 60.9% 0.071s 2.36e-05s C 3000 2 GpuCAReduce{maximum}{1,0}
1.9% 62.8% 0.070s 2.33e-05s C 3000 2 GpuCAReduce{add}{0,0,1}
1.7% 64.5% 0.063s 2.09e-05s C 3000 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
1.7% 66.2% 0.062s 2.06e-05s C 3000 2 GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)]
1.6% 67.8% 0.060s 1.99e-05s C 3000 2 GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)]
1.6% 69.3% 0.057s 1.91e-05s C 3000 2 GpuIncSubtensor{InplaceInc;::, int64::}
1.5% 70.9% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace}
1.5% 72.4% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
1.5% 73.9% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Composite{(i0 + (i1 * i2 * i3))}}[(0, 0)]
1.5% 75.4% 0.054s 1.81e-05s C 3000 2 GpuCAReduce{add}{1,0,0}
... (remaining 30 Ops account for 24.59%(0.90s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
1.5% 1.5% 0.055s 3.65e-05s 1500 26 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.5% 3.0% 0.054s 3.61e-05s 1500 35 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.5% 4.4% 0.054s 3.57e-05s 1500 146 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.4% 5.9% 0.053s 3.54e-05s 1500 166 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(200, 200), strides=(1, 200)
output 0: dtype=float32, shape=(10, 200), strides=c
1.4% 7.3% 0.053s 3.53e-05s 1500 168 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(200, 200), strides=(1, 200)
output 0: dtype=float32, shape=(10, 200), strides=c
1.4% 8.7% 0.052s 3.48e-05s 1500 147 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.3% 10.0% 0.047s 3.15e-05s 1500 80 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.3% 11.3% 0.047s 3.15e-05s 1500 82 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.2% 12.5% 0.046s 3.04e-05s 1500 167 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.2% 13.8% 0.045s 3.02e-05s 1500 169 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.2% 15.0% 0.044s 2.96e-05s 1500 2 GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.2% 16.1% 0.043s 2.88e-05s 1500 15 GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 200), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.2% 17.3% 0.043s 2.86e-05s 1500 116 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
input 0: dtype=float32, shape=(1, 10, 200), strides=c
input 1: dtype=float32, shape=(12, 10, 1), strides=c
input 2: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
1.1% 18.4% 0.042s 2.77e-05s 1500 117 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
input 0: dtype=float32, shape=(1, 10, 200), strides=c
input 1: dtype=float32, shape=(12, 10, 1), strides=c
input 2: dtype=float32, shape=(12, 10, 200), strides=c
output 0: dtype=float32, shape=(12, 10, 200), strides=c
1.1% 19.5% 0.041s 2.76e-05s 1500 133 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 120), strides=c
input 1: dtype=float32, shape=(120, 1), strides=c
output 0: dtype=float32, shape=(100, 1), strides=c
1.1% 20.7% 0.041s 2.75e-05s 1500 131 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 120), strides=c
input 1: dtype=float32, shape=(120, 1), strides=c
output 0: dtype=float32, shape=(100, 1), strides=c
1.1% 21.8% 0.041s 2.71e-05s 1500 21 GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.1% 22.9% 0.041s 2.70e-05s 1500 8 GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
1.1% 24.0% 0.040s 2.67e-05s 1500 172 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
1.1% 25.1% 0.040s 2.67e-05s 1500 170 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=(1, 100)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
... (remaining 156 Apply instances account for 74.95%(2.76s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 275KB (376KB)
CPU + GPU: 275KB (377KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 299KB (377KB)
CPU + GPU: 299KB (378KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 890KB
CPU + GPU: 890KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0)
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0)
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0)
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
48000B [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0)
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
... (remaining 156 Apply account for 346392B/1498392B ((23.12%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 2.039070e+00s
Total time spent in calling the VM 1.921504e+00s (94.234%)
Total overhead (computing slices..) 1.175666e-01s (5.766%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
47.2% 47.2% 0.424s 1.77e-05s C 24000 20 theano.sandbox.cuda.basic_ops.GpuElemwise
27.9% 75.1% 0.251s 3.48e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm
9.8% 84.8% 0.088s 1.83e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
6.7% 91.6% 0.060s 2.52e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22
4.6% 96.2% 0.041s 1.73e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc
2.0% 98.1% 0.018s 3.69e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
1.2% 99.4% 0.011s 2.34e-06s C 4800 4 theano.compile.ops.Shape_i
0.6% 100.0% 0.005s 2.25e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
19.8% 19.8% 0.178s 3.71e-05s C 4800 4 GpuGemm{no_inplace}
14.5% 34.3% 0.130s 1.81e-05s C 7200 6 GpuElemwise{mul,no_inplace}
8.1% 42.4% 0.073s 3.03e-05s C 2400 2 GpuGemm{inplace}
6.7% 49.1% 0.060s 2.52e-05s C 2400 2 GpuDot22
5.4% 54.5% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
5.0% 59.5% 0.045s 1.89e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::}
4.9% 64.5% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
4.7% 69.2% 0.043s 1.77e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:}
4.6% 73.8% 0.042s 1.74e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.6% 78.4% 0.041s 1.73e-05s C 2400 2 GpuAlloc{memset_0=True}
4.5% 83.0% 0.041s 1.70e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
4.5% 87.5% 0.041s 1.69e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)]
4.4% 91.9% 0.040s 1.65e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
4.3% 96.2% 0.038s 1.60e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)]
1.0% 97.2% 0.009s 3.79e-06s C 2400 2 GpuSubtensor{::, int64::}
1.0% 98.1% 0.009s 3.60e-06s C 2400 2 GpuSubtensor{::, :int64:}
0.6% 98.8% 0.006s 2.41e-06s C 2400 2 Shape_i{1}
0.6% 99.4% 0.005s 2.27e-06s C 2400 2 Shape_i{0}
0.6% 100.0% 0.005s 2.25e-06s C 2400 2 GpuDimShuffle{1,0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.4% 5.4% 0.049s 4.07e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
5.2% 10.6% 0.046s 3.86e-05s 1200 6 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
4.6% 15.2% 0.041s 3.45e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.6% 19.8% 0.041s 3.45e-05s 1200 18 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(100, 1)
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.0% 23.9% 0.036s 3.03e-05s 1200 40 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
4.0% 27.9% 0.036s 3.03e-05s 1200 41 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=(200, 1)
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
3.4% 31.3% 0.030s 2.52e-05s 1200 28 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
3.3% 34.6% 0.030s 2.51e-05s 1200 29 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 37.3% 0.024s 2.02e-05s 1200 42 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=c
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.7% 40.0% 0.024s 2.00e-05s 1200 43 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=(200, 1)
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=c
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.6% 42.5% 0.023s 1.91e-05s 1200 34 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.5% 45.0% 0.022s 1.87e-05s 1200 24 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.5% 47.5% 0.022s 1.86e-05s 1200 35 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=(200, 1)
input 1: dtype=float32, shape=(10, 100), strides=(100, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=(200, 1)
2.5% 50.0% 0.022s 1.85e-05s 1200 16 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 52.4% 0.022s 1.83e-05s 1200 25 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=(100, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 54.9% 0.022s 1.83e-05s 1200 17 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 57.3% 0.022s 1.81e-05s 1200 3 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 59.7% 0.022s 1.79e-05s 1200 31 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 62.1% 0.021s 1.79e-05s 1200 7 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
2.4% 64.5% 0.021s 1.79e-05s 1200 30 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=(100, 1)
input 1: dtype=float32, shape=(10, 100), strides=(200, 1)
output 0: dtype=float32, shape=(10, 100), strides=(100, 1)
... (remaining 24 Apply instances account for 35.55%(0.32s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 55KB (78KB)
CPU + GPU: 55KB (78KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 66KB (86KB)
CPU + GPU: 66KB (86KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 94KB
CPU + GPU: 94KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
==================
Message: None
Time in 100 calls of the op (for a total of 1200 steps) 2.040347e+00s
Total time spent in calling the VM 1.923158e+00s (94.256%)
Total overhead (computing slices..) 1.171889e-01s (5.744%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
47.2% 47.2% 0.424s 1.77e-05s C 24000 20 theano.sandbox.cuda.basic_ops.GpuElemwise
27.9% 75.1% 0.251s 3.49e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm
9.8% 84.9% 0.088s 1.83e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
6.7% 91.6% 0.060s 2.51e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22
4.6% 96.2% 0.041s 1.73e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc
2.0% 98.2% 0.018s 3.68e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
1.2% 99.4% 0.011s 2.29e-06s C 4800 4 theano.compile.ops.Shape_i
0.6% 100.0% 0.005s 2.28e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
19.8% 19.8% 0.178s 3.71e-05s C 4800 4 GpuGemm{no_inplace}
14.5% 34.3% 0.130s 1.81e-05s C 7200 6 GpuElemwise{mul,no_inplace}
8.1% 42.4% 0.073s 3.04e-05s C 2400 2 GpuGemm{inplace}
6.7% 49.1% 0.060s 2.51e-05s C 2400 2 GpuDot22
5.4% 54.5% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
5.0% 59.5% 0.045s 1.89e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::}
5.0% 64.5% 0.045s 1.86e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
4.7% 69.2% 0.042s 1.77e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:}
4.6% 73.9% 0.042s 1.73e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.6% 78.5% 0.041s 1.73e-05s C 2400 2 GpuAlloc{memset_0=True}
4.5% 83.0% 0.041s 1.70e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
4.5% 87.5% 0.040s 1.68e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)]
4.4% 91.9% 0.040s 1.66e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
4.3% 96.2% 0.039s 1.61e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)]
1.0% 97.2% 0.009s 3.78e-06s C 2400 2 GpuSubtensor{::, int64::}
1.0% 98.2% 0.009s 3.59e-06s C 2400 2 GpuSubtensor{::, :int64:}
0.6% 98.8% 0.006s 2.37e-06s C 2400 2 Shape_i{1}
0.6% 99.4% 0.005s 2.28e-06s C 2400 2 GpuDimShuffle{1,0}
0.6% 100.0% 0.005s 2.21e-06s C 2400 2 Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
5.4% 5.4% 0.049s 4.08e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
5.2% 10.6% 0.046s 3.86e-05s 1200 6 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 200), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
4.6% 15.2% 0.041s 3.45e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
4.6% 19.8% 0.041s 3.45e-05s 1200 18 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(100, 100), strides=c
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
4.1% 23.9% 0.036s 3.04e-05s 1200 40 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
4.1% 27.9% 0.036s 3.04e-05s 1200 41 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(), strides=c
input 2: dtype=float32, shape=(10, 200), strides=c
input 3: dtype=float32, shape=(200, 100), strides=(1, 200)
input 4: dtype=float32, shape=(), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
3.4% 31.3% 0.030s 2.51e-05s 1200 28 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=c
3.3% 34.6% 0.030s 2.51e-05s 1200 29 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 37.3% 0.024s 2.02e-05s 1200 42 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=c
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.7% 40.0% 0.024s 2.01e-05s 1200 43 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(10, 100), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
input 4: dtype=float32, shape=(10, 1), strides=c
input 5: dtype=float32, shape=(10, 100), strides=c
input 6: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.6% 42.6% 0.023s 1.91e-05s 1200 34 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.5% 45.1% 0.023s 1.88e-05s 1200 24 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.5% 47.6% 0.022s 1.86e-05s 1200 35 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
input 0: dtype=float32, shape=(10, 200), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(10, 200), strides=c
2.5% 50.0% 0.022s 1.84e-05s 1200 25 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
input 2: dtype=float32, shape=(1, 1), strides=c
input 3: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.5% 52.5% 0.022s 1.84e-05s 1200 16 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 54.9% 0.022s 1.82e-05s 1200 17 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 57.3% 0.022s 1.80e-05s 1200 3 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 59.7% 0.022s 1.79e-05s 1200 30 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 62.1% 0.021s 1.79e-05s 1200 31 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 100), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
2.4% 64.5% 0.021s 1.79e-05s 1200 7 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
input 0: dtype=float32, shape=(10, 100), strides=c
input 1: dtype=float32, shape=(10, 1), strides=c
output 0: dtype=float32, shape=(10, 100), strides=c
... (remaining 24 Apply instances account for 35.49%(0.32s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 55KB (78KB)
CPU + GPU: 55KB (78KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 66KB (86KB)
CPU + GPU: 66KB (86KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 0KB
GPU: 94KB
CPU + GPU: 94KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda])
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(17) printed profiles at exit excluding Scan op profile.
Time in 6938 calls to Function.__call__: 1.007439e+02s
Time in Function.fn.__call__: 1.003767e+02s (99.635%)
Time in thunks: 3.835574e+01s (38.073%)
Total compile time: 3.784477e+02s
Number of Apply nodes: 0
Theano Optimizer time: 1.654243e+02s
Theano validate time: 5.543999e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 1.313228e+02s
Import time 2.099285e+00s
Time in all call to theano.grad() 2.838947e+00s
Time since theano import 676.605s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
61.4% 61.4% 23.536s 2.79e-02s Py 844 11 theano.scan_module.scan_op.Scan
25.3% 86.7% 9.712s 6.03e-02s Py 161 2 lvsr.ops.EditDistanceOp
4.7% 91.3% 1.787s 2.06e-05s C 86853 879 theano.sandbox.cuda.basic_ops.GpuElemwise
1.8% 93.1% 0.678s 2.65e-05s C 25580 252 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.7% 94.8% 0.642s 7.29e-05s C 8805 89 theano.sandbox.cuda.blas.GpuDot22
1.0% 95.8% 0.395s 3.60e-06s C 109687 1234 theano.tensor.elemwise.Elemwise
0.8% 96.6% 0.297s 1.72e-05s C 17247 197 theano.sandbox.cuda.basic_ops.HostFromGpu
0.4% 97.0% 0.166s 2.21e-05s Py 7505 51 theano.ifelse.IfElse
0.4% 97.4% 0.161s 2.71e-05s C 5927 63 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 97.8% 0.142s 7.60e-06s C 18640 198 theano.sandbox.cuda.basic_ops.GpuReshape
0.4% 98.2% 0.138s 2.62e-05s C 5266 56 theano.sandbox.cuda.basic_ops.GpuAlloc
0.3% 98.5% 0.127s 3.37e-06s C 37733 384 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 98.8% 0.118s 7.43e-06s C 15813 114 theano.compile.ops.DeepCopyOp
0.1% 99.0% 0.057s 3.66e-06s C 15701 169 theano.tensor.opt.MakeVector
0.1% 99.1% 0.054s 1.60e-05s C 3393 29 theano.sandbox.cuda.basic_ops.GpuFromHost
0.1% 99.2% 0.050s 4.52e-06s C 11167 119 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 99.4% 0.048s 3.42e-06s C 14141 158 theano.compile.ops.Shape_i
0.1% 99.4% 0.034s 5.30e-05s C 648 7 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
0.1% 99.5% 0.033s 2.96e-06s C 10969 127 theano.tensor.basic.ScalarFromTensor
0.1% 99.6% 0.032s 8.55e-05s C 372 5 theano.sandbox.cuda.basic_ops.GpuJoin
... (remaining 22 Classes account for 0.38%(0.15s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
25.3% 25.3% 9.712s 6.03e-02s Py 161 2 EditDistanceOp
22.7% 48.0% 8.707s 8.71e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
13.7% 61.8% 5.270s 3.27e-02s Py 161 2 forall_inplace,gpu,generator_generate_scan}
10.7% 72.5% 4.113s 2.06e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
8.9% 81.4% 3.412s 3.41e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
5.1% 86.5% 1.957s 7.50e-03s Py 261 3 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
1.7% 88.2% 0.642s 7.29e-05s C 8805 89 GpuDot22
0.8% 88.9% 0.297s 1.72e-05s C 17247 197 HostFromGpu
0.7% 89.6% 0.262s 3.12e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1}
0.6% 90.2% 0.235s 2.12e-05s C 11100 111 GpuElemwise{add,no_inplace}
0.5% 90.7% 0.186s 2.12e-05s C 8783 89 GpuElemwise{sub,no_inplace}
0.4% 91.1% 0.152s 2.45e-05s Py 6200 39 if{gpu}
0.4% 91.5% 0.148s 2.28e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
0.4% 91.9% 0.143s 2.99e-05s C 4800 48 GpuCAReduce{add}{1,1}
0.4% 92.2% 0.138s 2.16e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
0.3% 92.6% 0.128s 1.97e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
0.3% 92.9% 0.128s 1.88e-05s C 6800 68 GpuElemwise{Mul}[(0, 0)]
0.3% 93.2% 0.127s 2.15e-05s C 5900 59 GpuElemwise{Switch,no_inplace}
0.3% 93.6% 0.126s 1.95e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
0.3% 93.9% 0.121s 2.06e-05s C 5900 59 GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace}
... (remaining 321 Ops account for 6.12%(2.35s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
22.7% 22.7% 8.707s 8.71e-02s 100 2437 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0,
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1)
input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0)
input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1)
input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1)
input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1)
input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
input 27: dtype=int64, shape=(), strides=c
input 28: dtype=int64, shape=(), strides=c
input 29: dtype=int64, shape=(), strides=c
input 30: dtype=int64, shape=(), strides=c
input 31: dtype=int64, shape=(), strides=c
input 32: dtype=int64, shape=(), strides=c
input 33: dtype=int64, shape=(), strides=c
input 34: dtype=int64, shape=(), strides=c
input 35: dtype=float32, shape=(100, 200), strides=c
input 36: dtype=float32, shape=(200, 200), strides=c
input 37: dtype=float32, shape=(100, 100), strides=c
input 38: dtype=float32, shape=(200, 100), strides=c
input 39: dtype=float32, shape=(100, 100), strides=c
input 40: dtype=float32, shape=(200, 200), strides=(1, 200)
input 41: dtype=float32, shape=(200, 100), strides=(1, 200)
input 42: dtype=float32, shape=(100, 100), strides=(1, 100)
input 43: dtype=float32, shape=(100, 200), strides=(1, 100)
input 44: dtype=float32, shape=(100, 100), strides=(1, 100)
input 45: dtype=int64, shape=(2,), strides=c
input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 47: dtype=int64, shape=(1,), strides=c
input 48: dtype=float32, shape=(12, 10), strides=(10, 1)
input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 50: dtype=float32, shape=(100, 1), strides=(1, 0)
input 51: dtype=int8, shape=(10,), strides=c
input 52: dtype=float32, shape=(1, 100), strides=(0, 1)
input 53: dtype=float32, shape=(100, 200), strides=c
input 54: dtype=float32, shape=(200, 200), strides=c
input 55: dtype=float32, shape=(100, 100), strides=c
input 56: dtype=float32, shape=(200, 100), strides=c
input 57: dtype=float32, shape=(100, 100), strides=c
input 58: dtype=float32, shape=(200, 200), strides=(1, 200)
input 59: dtype=float32, shape=(200, 100), strides=(1, 200)
input 60: dtype=float32, shape=(100, 100), strides=(1, 100)
input 61: dtype=float32, shape=(100, 200), strides=(1, 100)
input 62: dtype=float32, shape=(100, 100), strides=(1, 100)
input 63: dtype=int64, shape=(2,), strides=c
input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 65: dtype=int64, shape=(1,), strides=c
input 66: dtype=float32, shape=(12, 10), strides=(10, 1)
input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 68: dtype=float32, shape=(100, 1), strides=(1, 0)
input 69: dtype=int8, shape=(10,), strides=c
input 70: dtype=float32, shape=(1, 100), strides=(0, 1)
output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1)
output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1)
output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0)
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1)
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1)
output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1)
output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1)
22.6% 45.3% 8.684s 1.42e-01s 61 269 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
input 0: dtype=int64, shape=(15, 75), strides=c
input 1: dtype=float32, shape=(15, 75), strides=c
input 2: dtype=int64, shape=(12, 75), strides=c
input 3: dtype=float32, shape=(12, 75), strides=c
output 0: dtype=int64, shape=(15, 75, 1), strides=c
8.9% 54.2% 3.412s 3.41e-02s 100 2149 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1)
input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1)
input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
input 12: dtype=float32, shape=(100, 200), strides=c
input 13: dtype=float32, shape=(200, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(200, 100), strides=c
input 16: dtype=float32, shape=(100, 100), strides=c
input 17: dtype=float32, shape=(12, 10), strides=(10, 1)
input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 19: dtype=int64, shape=(1,), strides=c
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 21: dtype=int8, shape=(10,), strides=c
input 22: dtype=float32, shape=(100, 1), strides=(1, 0)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(200, 200), strides=c
input 25: dtype=float32, shape=(100, 100), strides=c
input 26: dtype=float32, shape=(200, 100), strides=c
input 27: dtype=float32, shape=(100, 100), strides=c
input 28: dtype=float32, shape=(12, 10), strides=(10, 1)
input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 30: dtype=int64, shape=(1,), strides=c
input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 32: dtype=int8, shape=(10,), strides=c
input 33: dtype=float32, shape=(100, 1), strides=(1, 0)
output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1)
output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1)
output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1)
7.8% 62.0% 2.984s 2.98e-02s 100 1850 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 10), strides=(10, 1)
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(10,), strides=c
input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 10), strides=c
6.0% 68.0% 2.286s 3.75e-02s 61 260 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1)
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=float32, shape=(100, 44), strides=c
input 6: dtype=float32, shape=(200, 44), strides=c
input 7: dtype=float32, shape=(100, 200), strides=c
input 8: dtype=float32, shape=(200, 200), strides=c
input 9: dtype=float32, shape=(45, 100), strides=c
input 10: dtype=float32, shape=(100, 200), strides=c
input 11: dtype=float32, shape=(100, 100), strides=c
input 12: dtype=float32, shape=(200, 100), strides=c
input 13: dtype=float32, shape=(100, 100), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
input 15: dtype=float32, shape=(1, 44), strides=(0, 1)
input 16: dtype=float32, shape=(1, 200), strides=(0, 1)
input 17: dtype=float32, shape=(1, 100), strides=(0, 1)
input 18: dtype=int64, shape=(1,), strides=c
input 19: dtype=float32, shape=(12, 75), strides=(75, 1)
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 21: dtype=float32, shape=(100, 1), strides=(1, 0)
input 22: dtype=int8, shape=(75,), strides=c
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1)
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1)
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1)
output 3: dtype=int64, shape=(15, 75), strides=c
5.4% 73.3% 2.057s 2.06e-02s 100 2632 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=int64, shape=(), strides=c
input 18: dtype=int64, shape=(), strides=c
input 19: dtype=float32, shape=(100, 200), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(200, 100), strides=(1, 200)
input 22: dtype=float32, shape=(100, 100), strides=(1, 100)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
input 25: dtype=float32, shape=(200, 100), strides=(1, 200)
input 26: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
5.4% 78.7% 2.056s 2.06e-02s 100 2631 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
input 13: dtype=int64, shape=(), strides=c
input 14: dtype=int64, shape=(), strides=c
input 15: dtype=int64, shape=(), strides=c
input 16: dtype=int64, shape=(), strides=c
input 17: dtype=int64, shape=(), strides=c
input 18: dtype=int64, shape=(), strides=c
input 19: dtype=float32, shape=(100, 200), strides=c
input 20: dtype=float32, shape=(100, 100), strides=c
input 21: dtype=float32, shape=(200, 100), strides=(1, 200)
input 22: dtype=float32, shape=(100, 100), strides=(1, 100)
input 23: dtype=float32, shape=(100, 200), strides=c
input 24: dtype=float32, shape=(100, 100), strides=c
input 25: dtype=float32, shape=(200, 100), strides=(1, 200)
input 26: dtype=float32, shape=(100, 100), strides=(1, 100)
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1)
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1)
2.7% 81.4% 1.028s 1.03e-02s 100 2005 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11)
input 0: dtype=int64, shape=(15, 10), strides=c
input 1: dtype=float32, shape=(15, 10), strides=c
input 2: dtype=int64, shape=(12, 10), strides=c
input 3: dtype=float32, shape=(12, 10), strides=c
output 0: dtype=int64, shape=(15, 10, 1), strides=c
1.8% 83.2% 0.696s 6.96e-03s 100 1642 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
1.8% 85.0% 0.694s 6.94e-03s 100 1652 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1)
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1)
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1)
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1)
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0)
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0)
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1)
1.5% 86.5% 0.567s 9.29e-03s 61 247 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1)
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1)
input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1)
input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0)
input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0)
input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
input 11: dtype=float32, shape=(100, 200), strides=c
input 12: dtype=float32, shape=(100, 100), strides=c
input 13: dtype=float32, shape=(100, 200), strides=c
input 14: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1)
0.1% 86.6% 0.039s 3.52e-03s 11 133 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.1% 86.7% 0.038s 3.46e-03s 11 175 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1)
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1)
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
input 4: dtype=float32, shape=(100, 200), strides=c
input 5: dtype=float32, shape=(100, 100), strides=c
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1)
0.1% 86.7% 0.024s 4.01e-06s 6075 0 DeepCopyOp(labels)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
0.0% 86.8% 0.019s 3.10e-04s 61 140 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.0% 86.8% 0.018s 3.03e-04s 61 142 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(900, 100), strides=(100, 1)
input 1: dtype=float32, shape=(100, 200), strides=(200, 1)
output 0: dtype=float32, shape=(900, 200), strides=(200, 1)
0.0% 86.9% 0.016s 2.64e-06s 6075 1 DeepCopyOp(inputs)
input 0: dtype=int64, shape=(12,), strides=c
output 0: dtype=int64, shape=(12,), strides=c
0.0% 86.9% 0.013s 1.31e-04s 100 2467 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(200, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(200, 200), strides=(200, 1)
0.0% 87.0% 0.013s 1.31e-04s 100 2463 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(200, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(200, 200), strides=(200, 1)
0.0% 87.0% 0.013s 1.28e-04s 100 2462 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
input 0: dtype=float32, shape=(100, 150), strides=(150, 1)
input 1: dtype=float32, shape=(150, 200), strides=(200, 1)
output 0: dtype=float32, shape=(100, 200), strides=(200, 1)
... (remaining 4255 Apply instances account for 13.01%(4.99s) of the runtime)
Memory Profile (the max between all functions in that profile)
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 58KB (62KB)
GPU: 3739KB (5373KB)
CPU + GPU: 3797KB (5435KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 57KB (62KB)
GPU: 5605KB (6697KB)
CPU + GPU: 5662KB (6758KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 114KB
GPU: 17091KB
CPU + GPU: 17205KB
---
This list is based on all functions in the profile
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
750480B [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
720000B [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
... (remaining 4255 Apply account for 58951889B/73960729B ((79.71%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment