Created
May 6, 2016 15:30
-
-
Save rizar/753bdaeebd6f16692c2218f537be831c to your computer and use it in GitHub Desktop.
New profile
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 100 calls to Function.__call__: 2.154827e-03s | |
Time in Function.fn.__call__: 9.248257e-04s (42.919%) | |
Total compile time: 4.125585e+00s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 6.079912e-03s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 9.608269e-05s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.132s | |
No execution time accumulated (hint: try config profiling.time_thunks=1) | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 11 calls to Function.__call__: 2.018499e-02s | |
Time in Function.fn.__call__: 1.745415e-02s (86.471%) | |
Time in thunks: 7.772207e-03s (38.505%) | |
Total compile time: 4.343552e+00s | |
Number of Apply nodes: 43 | |
Theano Optimizer time: 1.791000e-01s | |
Theano validate time: 1.072645e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 6.402516e-02s | |
Import time 4.774094e-03s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.132s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.008s 1.64e-05s C 473 43 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.008s 1.64e-05s C 473 43 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
4.7% 4.7% 0.000s 3.34e-05s 11 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.8% 7.5% 0.000s 1.99e-05s 11 31 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.7% 10.3% 0.000s 1.93e-05s 11 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.6% 12.9% 0.000s 1.84e-05s 11 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.6% 15.4% 0.000s 1.81e-05s 11 16 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.6% 18.0% 0.000s 1.81e-05s 11 23 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 20.5% 0.000s 1.80e-05s 11 3 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 23.1% 0.000s 1.80e-05s 11 24 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 25.6% 0.000s 1.80e-05s 11 4 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 28.2% 0.000s 1.79e-05s 11 27 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 30.7% 0.000s 1.78e-05s 11 25 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 33.2% 0.000s 1.78e-05s 11 8 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 35.7% 0.000s 1.78e-05s 11 5 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 38.2% 0.000s 1.77e-05s 11 12 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 40.7% 0.000s 1.77e-05s 11 6 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 43.2% 0.000s 1.76e-05s 11 29 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 45.7% 0.000s 1.75e-05s 11 11 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 48.2% 0.000s 1.75e-05s 11 7 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
2.5% 50.6% 0.000s 1.75e-05s 11 32 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
2.5% 53.1% 0.000s 1.75e-05s 11 13 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 23 Apply instances account for 46.88%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 43 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 10 calls to Function.__call__: 1.222110e-02s | |
Time in Function.fn.__call__: 1.176500e-02s (96.268%) | |
Time in thunks: 4.612923e-03s (37.746%) | |
Total compile time: 4.154817e+00s | |
Number of Apply nodes: 29 | |
Theano Optimizer time: 5.256701e-02s | |
Theano validate time: 1.211166e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.951882e-02s | |
Import time 1.188660e-02s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.137s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
52.9% 52.9% 0.002s 1.63e-05s C 150 15 theano.sandbox.cuda.basic_ops.HostFromGpu | |
43.7% 96.6% 0.002s 2.24e-05s C 90 9 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.4% 100.0% 0.000s 3.16e-06s C 50 5 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
52.9% 52.9% 0.002s 1.63e-05s C 150 15 HostFromGpu | |
43.7% 96.6% 0.002s 2.24e-05s C 90 9 GpuElemwise{true_div,no_inplace} | |
3.4% 100.0% 0.000s 3.16e-06s C 50 5 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
10.0% 10.0% 0.000s 4.61e-05s 10 0 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
5.8% 15.8% 0.000s 2.68e-05s 10 15 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
4.4% 20.2% 0.000s 2.03e-05s 10 1 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.3% 24.5% 0.000s 1.98e-05s 10 12 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.3% 28.7% 0.000s 1.96e-05s 10 2 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.2% 32.9% 0.000s 1.93e-05s 10 13 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.2% 37.1% 0.000s 1.93e-05s 10 4 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.2% 41.3% 0.000s 1.92e-05s 10 6 GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 45.4% 0.000s 1.91e-05s 10 3 GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
4.1% 49.5% 0.000s 1.90e-05s 10 5 GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.6% 53.1% 0.000s 1.65e-05s 10 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 56.6% 0.000s 1.63e-05s 10 26 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 60.1% 0.000s 1.59e-05s 10 20 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 63.5% 0.000s 1.58e-05s 10 19 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 66.9% 0.000s 1.57e-05s 10 7 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 70.3% 0.000s 1.56e-05s 10 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 73.7% 0.000s 1.56e-05s 10 27 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 77.1% 0.000s 1.56e-05s 10 8 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 80.4% 0.000s 1.54e-05s 10 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=() | |
output 0: dtype=float32, shape=(), strides=c | |
3.3% 83.7% 0.000s 1.52e-05s 10 14 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 9 Apply instances account for 16.31%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 29 Apply account for 136B/136B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 101 calls to Function.__call__: 1.747441e-02s | |
Time in Function.fn.__call__: 1.434040e-02s (82.065%) | |
Time in thunks: 2.486944e-03s (14.232%) | |
Total compile time: 4.068843e+00s | |
Number of Apply nodes: 6 | |
Theano Optimizer time: 1.878691e-02s | |
Theano validate time: 5.388260e-05s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.104212e-02s | |
Import time 7.761240e-03s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.140s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
54.7% 54.7% 0.001s 3.37e-06s C 404 4 theano.compile.ops.Shape_i | |
45.3% 100.0% 0.001s 5.58e-06s C 202 2 theano.tensor.basic.Alloc | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
45.3% 45.3% 0.001s 5.58e-06s C 202 2 Alloc | |
30.7% 76.0% 0.001s 3.78e-06s C 202 2 Shape_i{1} | |
24.0% 100.0% 0.001s 2.95e-06s C 202 2 Shape_i{0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
28.4% 28.4% 0.001s 7.00e-06s 101 4 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
input 0: dtype=int64, shape=(1, 1), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(15, 10), strides=c | |
19.0% 47.5% 0.000s 4.69e-06s 101 0 Shape_i{1}(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
16.9% 64.3% 0.000s 4.16e-06s 101 5 Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
input 0: dtype=int64, shape=(1, 1), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(12, 10), strides=c | |
12.9% 77.2% 0.000s 3.17e-06s 101 1 Shape_i{0}(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
11.7% 88.9% 0.000s 2.88e-06s 101 2 Shape_i{1}(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
11.1% 100.0% 0.000s 2.73e-06s 101 3 Shape_i{0}(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 2KB | |
GPU: 0KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1200B [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0) | |
... (remaining 5 Apply account for 992B/2192B ((45.26%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 100 calls to Function.__call__: 1.629472e-02s | |
Time in Function.fn.__call__: 1.466155e-02s (89.977%) | |
Time in thunks: 9.594440e-03s (58.881%) | |
Total compile time: 4.084757e+00s | |
Number of Apply nodes: 2 | |
Theano Optimizer time: 7.371902e-03s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.080990e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.141s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.010s 4.80e-05s C 200 2 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.010s 4.80e-05s C 200 2 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
95.0% 95.0% 0.009s 9.11e-05s 100 0 DeepCopyOp(shared_recognizer_costs_prediction) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10), strides=c | |
5.0% 100.0% 0.000s 4.83e-06s 100 1 DeepCopyOp(shared_labels) | |
input 0: dtype=int64, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(12, 10), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 2KB (2KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 2KB | |
GPU: 0KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1200B [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction) | |
... (remaining 1 Apply account for 960B/2160B ((44.44%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 2 calls to Function.__call__: 2.764940e-03s | |
Time in Function.fn.__call__: 2.352715e-03s (85.091%) | |
Time in thunks: 1.017094e-03s (36.785%) | |
Total compile time: 4.452709e+00s | |
Number of Apply nodes: 31 | |
Theano Optimizer time: 9.523201e-02s | |
Theano validate time: 7.679462e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 4.307699e-02s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.142s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.001s 1.64e-05s C 62 31 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.001s 1.64e-05s C 62 31 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
4.7% 4.7% 0.000s 2.41e-05s 2 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.8% 8.6% 0.000s 1.94e-05s 2 6 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.8% 12.3% 0.000s 1.91e-05s 2 14 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.7% 16.0% 0.000s 1.90e-05s 2 7 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.7% 19.7% 0.000s 1.88e-05s 2 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.7% 23.4% 0.000s 1.86e-05s 2 12 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 27.0% 0.000s 1.85e-05s 2 4 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.6% 30.7% 0.000s 1.85e-05s 2 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.6% 34.2% 0.000s 1.81e-05s 2 16 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 37.8% 0.000s 1.81e-05s 2 9 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.6% 41.4% 0.000s 1.81e-05s 2 8 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.6% 44.9% 0.000s 1.81e-05s 2 5 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.6% 48.5% 0.000s 1.81e-05s 2 3 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
3.5% 52.0% 0.000s 1.80e-05s 2 24 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 55.6% 0.000s 1.80e-05s 2 19 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 59.1% 0.000s 1.80e-05s 2 13 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 62.6% 0.000s 1.79e-05s 2 11 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.5% 66.1% 0.000s 1.79e-05s 2 10 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 69.6% 0.000s 1.75e-05s 2 23 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
3.4% 73.0% 0.000s 1.75e-05s 2 21 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
... (remaining 11 Apply instances account for 26.98%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 31 Apply account for 140B/140B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 8.559227e-04s | |
Time in Function.fn.__call__: 8.108616e-04s (94.735%) | |
Time in thunks: 3.142357e-04s (36.713%) | |
Total compile time: 4.539160e+00s | |
Number of Apply nodes: 21 | |
Theano Optimizer time: 3.893209e-02s | |
Theano validate time: 8.273125e-05s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.924204e-02s | |
Import time 2.619028e-03s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.146s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
58.3% 58.3% 0.000s 1.66e-05s C 11 11 theano.sandbox.cuda.basic_ops.HostFromGpu | |
36.9% 95.1% 0.000s 1.93e-05s C 6 6 theano.sandbox.cuda.basic_ops.GpuElemwise | |
4.9% 100.0% 0.000s 3.81e-06s C 4 4 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
58.3% 58.3% 0.000s 1.66e-05s C 11 11 HostFromGpu | |
36.9% 95.1% 0.000s 1.93e-05s C 6 6 GpuElemwise{true_div,no_inplace} | |
4.9% 100.0% 0.000s 3.81e-06s C 4 4 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
7.0% 7.0% 0.000s 2.19e-05s 1 0 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.7% 13.7% 0.000s 2.10e-05s 1 1 GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.4% 20.0% 0.000s 2.00e-05s 1 3 GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.1% 26.1% 0.000s 1.91e-05s 1 7 GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.0% 32.1% 0.000s 1.88e-05s 1 8 GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
6.0% 38.1% 0.000s 1.88e-05s 1 6 GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.8% 43.9% 0.000s 1.81e-05s 1 16 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.8% 49.6% 0.000s 1.81e-05s 1 2 GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.7% 55.3% 0.000s 1.79e-05s 1 12 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.5% 60.8% 0.000s 1.72e-05s 1 11 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.2% 65.9% 0.000s 1.62e-05s 1 13 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.1% 71.0% 0.000s 1.60e-05s 1 17 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
5.1% 76.1% 0.000s 1.60e-05s 1 5 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.8% 80.9% 0.000s 1.50e-05s 1 18 HostFromGpu(GpuElemwise{true_div,no_inplace}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.8% 85.7% 0.000s 1.50e-05s 1 10 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.8% 90.4% 0.000s 1.50e-05s 1 4 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
4.7% 95.1% 0.000s 1.48e-05s 1 9 HostFromGpu(shared_weights_penalty) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
1.6% 96.7% 0.000s 5.01e-06s 1 19 Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
1.3% 98.0% 0.000s 4.05e-06s 1 15 Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
1.0% 99.0% 0.000s 3.10e-06s 1 20 Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0) | |
input 0: dtype=float64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 1 Apply instances account for 0.99%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 21 Apply account for 100B/100B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171 | |
Time in 1 calls to Function.__call__: 4.639626e-04s | |
Time in Function.fn.__call__: 2.970695e-04s (64.029%) | |
Time in thunks: 1.273155e-04s (27.441%) | |
Total compile time: 4.479136e+00s | |
Number of Apply nodes: 5 | |
Theano Optimizer time: 1.386118e-02s | |
Theano validate time: 1.111031e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 6.145954e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.148s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.000s 2.55e-05s C 5 5 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.000s 2.55e-05s C 5 5 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
59.7% 59.7% 0.000s 7.61e-05s 1 0 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
15.7% 75.5% 0.000s 2.00e-05s 1 1 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
13.5% 89.0% 0.000s 1.72e-05s 1 2 DeepCopyOp(CudaNdarrayConstant{0.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
7.9% 96.8% 0.000s 1.00e-05s 1 3 DeepCopyOp(TensorConstant{0}) | |
input 0: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
3.2% 100.0% 0.000s 4.05e-06s 1 4 DeepCopyOp(TensorConstant{0.0}) | |
input 0: dtype=float64, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 5 Apply account for 28B/28B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 1.401901e-04s | |
Time in Function.fn.__call__: 1.139641e-04s (81.293%) | |
Time in thunks: 3.004074e-05s (21.429%) | |
Total compile time: 4.912266e+00s | |
Number of Apply nodes: 3 | |
Theano Optimizer time: 1.049495e-02s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.658844e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.149s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
66.7% 66.7% 0.000s 2.00e-05s C 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
19.8% 86.5% 0.000s 5.96e-06s C 1 1 theano.compile.ops.DeepCopyOp | |
13.5% 100.0% 0.000s 4.05e-06s C 1 1 theano.tensor.elemwise.Elemwise | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
66.7% 66.7% 0.000s 2.00e-05s C 1 1 HostFromGpu | |
19.8% 86.5% 0.000s 5.96e-06s C 1 1 DeepCopyOp | |
13.5% 100.0% 0.000s 4.05e-06s C 1 1 Elemwise{true_div,no_inplace} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
66.7% 66.7% 0.000s 2.00e-05s 1 1 HostFromGpu(shared_None) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=c | |
19.8% 86.5% 0.000s 5.96e-06s 1 0 DeepCopyOp(shared_batch_size) | |
input 0: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
13.5% 100.0% 0.000s 4.05e-06s 1 2 Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0) | |
input 0: dtype=float64, shape=(), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
output 0: dtype=float64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 3 Apply account for 20B/20B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286 | |
Time in 61 calls to Function.__call__: 1.211181e+01s | |
Time in Function.fn.__call__: 1.210473e+01s (99.942%) | |
Time in thunks: 1.171248e+01s (96.703%) | |
Total compile time: 1.925457e+01s | |
Number of Apply nodes: 274 | |
Theano Optimizer time: 5.967708e+00s | |
Theano validate time: 2.864373e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 9.222651e+00s | |
Import time 3.308520e-01s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.150s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
74.1% 74.1% 8.684s 1.42e-01s Py 61 1 lvsr.ops.EditDistanceOp | |
24.4% 98.5% 2.853s 2.34e-02s Py 122 2 theano.scan_module.scan_op.Scan | |
0.5% 99.0% 0.064s 2.10e-04s C 305 5 theano.sandbox.cuda.blas.GpuDot22 | |
0.2% 99.2% 0.023s 4.16e-05s C 549 9 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.2% 99.4% 0.021s 2.93e-06s C 7259 119 theano.tensor.elemwise.Elemwise | |
0.1% 99.5% 0.012s 1.93e-04s C 61 1 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.1% 99.6% 0.008s 3.45e-05s C 244 4 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.1% 99.7% 0.007s 2.32e-05s C 305 5 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.0% 99.7% 0.005s 8.45e-05s C 61 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
0.0% 99.7% 0.004s 2.97e-06s C 1464 24 theano.compile.ops.Shape_i | |
0.0% 99.8% 0.004s 2.14e-05s C 183 3 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.0% 99.8% 0.004s 3.71e-06s C 976 16 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.0% 99.8% 0.003s 2.92e-06s C 1098 18 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.0% 99.9% 0.003s 2.92e-06s C 1037 17 theano.tensor.opt.MakeVector | |
0.0% 99.9% 0.003s 2.41e-05s C 122 2 theano.compile.ops.DeepCopyOp | |
0.0% 99.9% 0.002s 2.38e-06s C 1037 17 theano.tensor.basic.ScalarFromTensor | |
0.0% 99.9% 0.002s 7.78e-06s Py 305 3 theano.ifelse.IfElse | |
0.0% 99.9% 0.002s 4.07e-06s C 549 9 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 100.0% 0.002s 5.39e-06s C 305 5 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.0% 100.0% 0.001s 6.56e-06s Py 183 3 theano.compile.ops.Rebroadcast | |
... (remaining 8 Classes account for 0.03%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
74.1% 74.1% 8.684s 1.42e-01s Py 61 1 EditDistanceOp | |
19.5% 93.7% 2.286s 3.75e-02s Py 61 1 forall_inplace,gpu,generator_generate_scan} | |
4.8% 98.5% 0.567s 9.29e-03s Py 61 1 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
0.5% 99.0% 0.064s 2.10e-04s C 305 5 GpuDot22 | |
0.2% 99.2% 0.018s 5.83e-05s C 305 5 GpuElemwise{Add}[(0, 0)] | |
0.1% 99.3% 0.012s 1.93e-04s C 61 1 GpuJoin | |
0.1% 99.4% 0.007s 2.32e-05s C 305 5 GpuIncSubtensor{InplaceSet;:int64:} | |
0.1% 99.4% 0.007s 3.83e-05s C 183 3 GpuAlloc | |
0.0% 99.5% 0.005s 8.45e-05s C 61 1 GpuAdvancedSubtensor1 | |
0.0% 99.5% 0.004s 2.14e-05s C 183 3 HostFromGpu | |
0.0% 99.5% 0.004s 2.10e-05s C 183 3 GpuElemwise{sub,no_inplace} | |
0.0% 99.6% 0.003s 2.92e-06s C 1037 17 MakeVector{dtype='int64'} | |
0.0% 99.6% 0.003s 2.41e-05s C 122 2 DeepCopyOp | |
0.0% 99.6% 0.002s 3.71e-06s C 671 11 GpuReshape{2} | |
0.0% 99.6% 0.002s 2.38e-06s C 1037 17 ScalarFromTensor | |
0.0% 99.6% 0.002s 2.81e-06s C 793 13 Shape_i{0} | |
0.0% 99.7% 0.002s 3.16e-06s C 671 11 Shape_i{1} | |
0.0% 99.7% 0.002s 2.76e-06s C 671 11 Elemwise{add,no_inplace} | |
0.0% 99.7% 0.002s 2.72e-06s C 610 10 Elemwise{sub,no_inplace} | |
0.0% 99.7% 0.002s 5.39e-06s C 305 5 GpuAllocEmpty | |
... (remaining 72 Ops account for 0.30%(0.03s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
74.1% 74.1% 8.684s 1.42e-01s 61 269 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask) | |
input 0: dtype=int64, shape=(15, 75), strides=c | |
input 1: dtype=float32, shape=(15, 75), strides=c | |
input 2: dtype=int64, shape=(12, 75), strides=c | |
input 3: dtype=float32, shape=(12, 75), strides=c | |
output 0: dtype=int64, shape=(15, 75, 1), strides=c | |
19.5% 93.7% 2.286s 3.75e-02s 61 260 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 75), strides=(75, 1) | |
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(75,), strides=c | |
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 75), strides=c | |
4.8% 98.5% 0.567s 9.29e-03s 61 247 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) | |
input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) | |
input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.2% 98.7% 0.019s 3.10e-04s 61 140 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.2% 98.8% 0.018s 3.03e-04s 61 142 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.1% 98.9% 0.012s 1.93e-04s 61 255 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.1% 99.0% 0.011s 1.85e-04s 61 257 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.1% 99.1% 0.008s 1.27e-04s 61 139 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.1% 99.1% 0.008s 1.24e-04s 61 141 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.2% 0.005s 8.45e-05s 61 65 GpuAdvancedSubtensor1(W, Reshape{1}.0) | |
input 0: dtype=float32, shape=(44, 100), strides=c | |
input 1: dtype=int64, shape=(900,), strides=c | |
output 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
0.0% 99.2% 0.005s 7.52e-05s 61 170 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.0% 99.3% 0.005s 7.49e-05s 61 172 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
0.0% 99.3% 0.003s 4.81e-05s 61 169 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.3% 0.003s 4.72e-05s 61 259 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.3% 0.003s 4.63e-05s 61 171 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.0% 99.4% 0.002s 4.08e-05s 61 47 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.4% 0.002s 3.73e-05s 61 107 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.4% 0.002s 3.67e-05s 61 59 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
0.0% 99.4% 0.002s 3.37e-05s 61 160 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
input 0: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 1: dtype=float32, shape=(1, 92160), strides=(0, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
0.0% 99.4% 0.002s 2.63e-05s 61 4 DeepCopyOp(CudaNdarrayConstant{1.0}) | |
input 0: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(), strides=() | |
... (remaining 254 Apply instances account for 0.56%(0.07s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 22KB (22KB) | |
GPU: 3175KB (3660KB) | |
CPU + GPU: 3197KB (3682KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 22KB (22KB) | |
GPU: 3526KB (4334KB) | |
CPU + GPU: 3548KB (4356KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 36KB | |
GPU: 5187KB | |
CPU + GPU: 5223KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0) | |
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>) | |
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0) | |
360000B [(12, 75, 100)] c GpuAllocEmpty(Elemwise{add,no_inplace}.0, Elemwise{Switch}[(0, 1)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0) | |
360000B [(900, 100)] c GpuAdvancedSubtensor1(W, Reshape{1}.0) | |
360000B [(12, 75, 100)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
... (remaining 254 Apply account for 6802854B/19219614B ((35.40%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 61 calls of the op (for a total of 732 steps) 5.621994e-01s | |
Total time spent in calling the VM 5.386684e-01s (95.814%) | |
Total overhead (computing slices..) 2.353096e-02s (4.186%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
68.8% 68.8% 0.229s 7.80e-05s C 2928 4 theano.sandbox.cuda.blas.GpuGemm | |
28.4% 97.1% 0.094s 2.15e-05s C 4392 6 theano.sandbox.cuda.basic_ops.GpuElemwise | |
2.9% 100.0% 0.010s 3.25e-06s C 2928 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
68.8% 68.8% 0.229s 7.80e-05s C 2928 4 GpuGemm{no_inplace} | |
10.5% 79.3% 0.035s 2.38e-05s C 1464 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
9.2% 88.4% 0.030s 2.08e-05s C 1464 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
8.7% 97.1% 0.029s 1.98e-05s C 1464 2 GpuElemwise{mul,no_inplace} | |
1.5% 98.7% 0.005s 3.44e-06s C 1464 2 GpuSubtensor{::, :int64:} | |
1.3% 100.0% 0.004s 3.06e-06s C 1464 2 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
23.0% 23.0% 0.076s 1.04e-04s 732 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
22.5% 45.5% 0.075s 1.02e-04s 732 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
11.7% 57.1% 0.039s 5.30e-05s 732 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
11.6% 68.8% 0.039s 5.27e-05s 732 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.3% 74.0% 0.018s 2.40e-05s 732 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(75, 1), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
5.2% 79.3% 0.017s 2.36e-05s 732 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(75, 1), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(75, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
4.6% 83.9% 0.015s 2.09e-05s 732 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
4.6% 88.4% 0.015s 2.07e-05s 732 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
4.4% 92.8% 0.015s 2.00e-05s 732 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
4.3% 97.1% 0.014s 1.96e-05s 732 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.8% 97.9% 0.003s 3.46e-06s 732 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.8% 98.7% 0.002s 3.41e-06s 732 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.7% 99.3% 0.002s 3.10e-06s 732 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
0.7% 100.0% 0.002s 3.01e-06s 732 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 146KB (205KB) | |
CPU + GPU: 146KB (205KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 146KB (205KB) | |
CPU + GPU: 146KB (205KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 293KB | |
CPU + GPU: 293KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
60000B [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
60000B [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
30000B [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
30000B [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
... (remaining 0 Apply account for 0B/540000B ((0.00%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( generator_generate_scan ) | |
================== | |
Message: None | |
Time in 61 calls of the op (for a total of 915 steps) 2.276112e+00s | |
Total time spent in calling the VM 2.183355e+00s (95.925%) | |
Total overhead (computing slices..) 9.275723e-02s (4.075%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
27.2% 27.2% 0.343s 7.49e-05s C 4575 5 theano.sandbox.cuda.blas.GpuGemm | |
25.6% 52.8% 0.322s 2.70e-05s C 11895 13 theano.sandbox.cuda.basic_ops.GpuElemwise | |
21.5% 74.3% 0.271s 5.92e-05s C 4575 5 theano.sandbox.cuda.blas.GpuDot22 | |
8.2% 82.5% 0.103s 2.25e-05s C 4575 5 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
3.2% 85.7% 0.041s 4.44e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
3.1% 88.8% 0.039s 4.23e-05s C 915 1 theano.sandbox.rng_mrg.GPU_mrg_uniform | |
2.9% 91.7% 0.037s 2.02e-05s C 1830 2 theano.sandbox.cuda.basic_ops.HostFromGpu | |
1.9% 93.6% 0.024s 2.64e-05s C 915 1 theano.tensor.basic.MaxAndArgmax | |
1.1% 94.7% 0.014s 1.51e-05s C 915 1 theano.sandbox.multinomial.MultinomialFromUniform | |
1.1% 95.8% 0.013s 2.43e-06s C 5490 6 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.0% 96.8% 0.013s 1.43e-05s C 915 1 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.8% 97.7% 0.011s 2.31e-06s C 4575 5 theano.compile.ops.Shape_i | |
0.7% 98.4% 0.009s 3.28e-06s C 2745 3 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.6% 98.9% 0.007s 1.93e-06s C 3660 4 theano.tensor.opt.MakeVector | |
0.5% 99.4% 0.006s 3.39e-06s C 1830 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.3% 99.8% 0.004s 2.30e-06s C 1830 2 theano.tensor.elemwise.Elemwise | |
0.2% 100.0% 0.003s 3.25e-06s C 915 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
27.2% 27.2% 0.343s 7.49e-05s C 4575 5 GpuGemm{inplace} | |
21.5% 48.8% 0.271s 5.92e-05s C 4575 5 GpuDot22 | |
7.0% 55.7% 0.088s 4.80e-05s C 1830 2 GpuElemwise{mul,no_inplace} | |
3.5% 59.2% 0.043s 4.75e-05s C 915 1 GpuElemwise{add,no_inplace} | |
3.2% 62.4% 0.041s 4.44e-05s C 915 1 GpuAdvancedSubtensor1 | |
3.1% 65.5% 0.039s 4.23e-05s C 915 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace} | |
2.9% 68.4% 0.037s 2.02e-05s C 1830 2 HostFromGpu | |
2.2% 70.6% 0.028s 3.01e-05s C 915 1 GpuCAReduce{add}{1,0,0} | |
2.2% 72.8% 0.027s 2.98e-05s C 915 1 GpuElemwise{Tanh}[(0, 0)] | |
1.9% 74.7% 0.024s 2.64e-05s C 915 1 MaxAndArgmax | |
1.9% 76.5% 0.023s 2.56e-05s C 915 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
1.8% 78.3% 0.022s 2.42e-05s C 915 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace} | |
1.7% 80.0% 0.021s 2.28e-05s C 915 1 GpuCAReduce{maximum}{0,1} | |
1.6% 81.6% 0.020s 2.24e-05s C 915 1 GpuCAReduce{maximum}{1,0} | |
1.4% 83.0% 0.018s 1.92e-05s C 915 1 GpuElemwise{Add}[(0, 1)] | |
1.4% 84.4% 0.018s 1.92e-05s C 915 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
1.4% 85.7% 0.017s 1.89e-05s C 915 1 GpuCAReduce{add}{0,1} | |
1.4% 87.1% 0.017s 1.89e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
1.4% 88.5% 0.017s 1.87e-05s C 915 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)] | |
1.3% 89.8% 0.017s 1.85e-05s C 915 1 GpuElemwise{TrueDiv}[(0, 0)] | |
... (remaining 20 Ops account for 10.18%(0.13s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
9.7% 9.7% 0.122s 1.34e-04s 915 10 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
7.4% 17.1% 0.093s 1.01e-04s 915 5 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
6.6% 23.7% 0.084s 9.14e-05s 915 32 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
5.6% 29.3% 0.071s 7.73e-05s 915 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(900, 100), strides=c | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(900, 1), strides=c | |
5.5% 34.8% 0.069s 7.57e-05s 915 56 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
input 0: dtype=float32, shape=(12, 75, 1), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=c | |
output 0: dtype=float32, shape=(12, 75, 200), strides=c | |
4.7% 39.5% 0.059s 6.45e-05s 915 38 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.5% 43.0% 0.043s 4.75e-05s 915 43 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 75, 100), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=c | |
3.2% 46.2% 0.041s 4.44e-05s 915 29 GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(75,), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.2% 49.4% 0.040s 4.35e-05s 915 8 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 44), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 200), strides=c | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
3.1% 52.5% 0.039s 4.25e-05s 915 37 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.1% 55.5% 0.039s 4.23e-05s 915 13 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(92160,), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(92160,), strides=c | |
output 1: dtype=float32, shape=(75,), strides=c | |
3.0% 58.6% 0.038s 4.17e-05s 915 41 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
3.0% 61.6% 0.038s 4.14e-05s 915 39 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
2.4% 64.0% 0.031s 3.35e-05s 915 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda]) | |
input 0: dtype=float32, shape=(75, 100), strides=c | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
2.2% 66.2% 0.028s 3.01e-05s 915 57 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
input 0: dtype=float32, shape=(12, 75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
2.2% 68.4% 0.027s 2.98e-05s 915 45 GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=c | |
output 0: dtype=float32, shape=(900, 100), strides=c | |
1.9% 70.3% 0.024s 2.64e-05s 915 27 MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1}) | |
input 0: dtype=int64, shape=(75, 44), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=int64, shape=(75,), strides=c | |
output 1: dtype=int64, shape=(75,), strides=c | |
1.9% 72.1% 0.023s 2.56e-05s 915 33 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(75, 200), strides=c | |
output 0: dtype=float32, shape=(75, 200), strides=c | |
1.8% 73.9% 0.022s 2.42e-05s 915 40 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(75, 100), strides=c | |
input 2: dtype=float32, shape=(75, 100), strides=c | |
input 3: dtype=float32, shape=(75, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(75, 100), strides=c | |
1.7% 75.6% 0.021s 2.29e-05s 915 25 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(75, 44), strides=c | |
output 0: dtype=float32, shape=(75, 44), strides=c | |
... (remaining 38 Apply instances account for 24.45%(0.31s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 39KB (39KB) | |
GPU: 1151KB (1151KB) | |
CPU + GPU: 1190KB (1190KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 39KB (39KB) | |
GPU: 1151KB (1151KB) | |
CPU + GPU: 1190KB (1190KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 41KB | |
GPU: 1709KB | |
CPU + GPU: 1750KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
720000B [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
368940B [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
360000B [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
360000B [(12, 75, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0) | |
360000B [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
60000B [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
60000B [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
60000B [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
60000B [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
60000B [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
30000B [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
30000B [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
30000B [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
30000B [(1, 75, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0) | |
30000B [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
30000B [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
30000B [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
30000B [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
30000B [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
... (remaining 38 Apply account for 158879B/2927819B ((5.43%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103 | |
Time in 11 calls to Function.__call__: 1.319449e-01s | |
Time in Function.fn.__call__: 1.316185e-01s (99.753%) | |
Time in thunks: 8.657598e-02s (65.615%) | |
Total compile time: 1.813622e+01s | |
Number of Apply nodes: 183 | |
Theano Optimizer time: 4.002905e+00s | |
Theano validate time: 1.576922e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.015641e+01s | |
Import time 6.427932e-02s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.235s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
88.6% 88.6% 0.077s 3.49e-03s Py 22 2 theano.scan_module.scan_op.Scan | |
3.5% 92.1% 0.003s 2.82e-06s C 1089 99 theano.tensor.elemwise.Elemwise | |
1.5% 93.6% 0.001s 2.96e-05s C 44 4 theano.sandbox.cuda.blas.GpuDot22 | |
1.0% 94.6% 0.001s 1.91e-05s C 44 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.7% 95.3% 0.001s 5.69e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.6% 95.9% 0.000s 2.26e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.6% 96.5% 0.000s 4.40e-05s C 11 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
0.5% 97.0% 0.000s 3.59e-06s C 121 11 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.5% 97.5% 0.000s 1.97e-05s C 22 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.5% 98.0% 0.000s 2.91e-06s C 143 13 theano.compile.ops.Shape_i | |
0.4% 98.4% 0.000s 2.84e-06s C 121 11 theano.tensor.opt.MakeVector | |
0.4% 98.7% 0.000s 2.31e-06s C 132 12 theano.tensor.basic.ScalarFromTensor | |
0.3% 99.0% 0.000s 4.05e-06s C 66 6 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.3% 99.3% 0.000s 2.84e-06s C 88 8 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.3% 99.6% 0.000s 2.26e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.2% 99.8% 0.000s 6.34e-06s Py 22 2 theano.compile.ops.Rebroadcast | |
0.1% 99.9% 0.000s 5.42e-06s C 22 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty | |
0.1% 100.0% 0.000s 5.35e-06s C 11 1 theano.tensor.basic.Alloc | |
0.0% 100.0% 0.000s 3.19e-06s C 11 1 theano.tensor.basic.Reshape | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
88.6% 88.6% 0.077s 3.49e-03s Py 22 2 forall_inplace,gpu,gatedrecurrent_apply_scan} | |
1.5% 90.1% 0.001s 2.96e-05s C 44 4 GpuDot22 | |
1.0% 91.1% 0.001s 1.91e-05s C 44 4 GpuElemwise{Add}[(0, 0)] | |
0.7% 91.8% 0.001s 5.69e-05s C 11 1 GpuJoin | |
0.6% 92.4% 0.000s 2.26e-05s C 22 2 GpuAlloc | |
0.6% 92.9% 0.000s 4.40e-05s C 11 1 GpuAdvancedSubtensor1 | |
0.5% 93.4% 0.000s 1.97e-05s C 22 2 GpuIncSubtensor{InplaceSet;:int64:} | |
0.4% 93.8% 0.000s 2.84e-06s C 121 11 MakeVector{dtype='int64'} | |
0.4% 94.2% 0.000s 2.31e-06s C 132 12 ScalarFromTensor | |
0.3% 94.5% 0.000s 3.66e-06s C 77 7 GpuReshape{2} | |
0.3% 94.8% 0.000s 2.80e-06s C 99 9 Elemwise{add,no_inplace} | |
0.3% 95.1% 0.000s 2.26e-05s C 11 1 HostFromGpu | |
0.3% 95.4% 0.000s 3.05e-06s C 77 7 Shape_i{0} | |
0.3% 95.6% 0.000s 2.60e-06s C 88 8 Elemwise{le,no_inplace} | |
0.2% 95.9% 0.000s 2.94e-06s C 66 6 GpuDimShuffle{x,x,0} | |
0.2% 96.1% 0.000s 2.75e-06s C 66 6 Shape_i{1} | |
0.2% 96.3% 0.000s 2.53e-06s C 66 6 Elemwise{sub,no_inplace} | |
0.2% 96.5% 0.000s 2.98e-06s C 55 5 Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)] | |
0.2% 96.6% 0.000s 3.46e-06s C 44 4 GpuReshape{3} | |
0.2% 96.8% 0.000s 2.60e-06s C 55 5 Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} | |
... (remaining 54 Ops account for 3.19%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
44.7% 44.7% 0.039s 3.52e-03s 11 133 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
43.9% 88.6% 0.038s 3.46e-03s 11 175 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.7% 89.3% 0.001s 5.69e-05s 11 181 GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
input 0: dtype=int8, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.6% 89.9% 0.000s 4.40e-05s 11 26 GpuAdvancedSubtensor1(W, Reshape{1}.0) | |
input 0: dtype=float32, shape=(44, 100), strides=c | |
input 1: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.4% 90.3% 0.000s 3.07e-05s 11 51 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
0.4% 90.7% 0.000s 3.06e-05s 11 49 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
0.4% 91.0% 0.000s 2.92e-05s 11 50 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.4% 91.4% 0.000s 2.80e-05s 11 48 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
0.3% 91.7% 0.000s 2.40e-05s 11 96 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.3% 92.0% 0.000s 2.26e-05s 11 182 HostFromGpu(GpuJoin.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=c | |
0.3% 92.2% 0.000s 2.12e-05s 11 64 GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.3% 92.5% 0.000s 2.05e-05s 11 130 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.3% 92.8% 0.000s 2.02e-05s 11 71 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 93.0% 0.000s 1.92e-05s 11 73 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 93.2% 0.000s 1.89e-05s 11 160 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.2% 93.5% 0.000s 1.87e-05s 11 72 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.2% 93.7% 0.000s 1.85e-05s 11 74 GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
0.1% 93.8% 0.000s 6.39e-06s 11 125 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
0.1% 93.9% 0.000s 6.35e-06s 11 159 Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(GE(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 + i3 + i4 + i5), i0)}((Composite{((Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 - i3), i0)}(Composite{((i0 - (Switch(LT(i1, i2), i2, i1) - i3)) - i4)}(i0, Composite{(((i0 - i1) // i2) + i3)}(i1, i2, i3, i4), i5, i6, i7), | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
input 3: dtype=int8, shape=(), strides=c | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=int8, shape=(), strides=c | |
input 6: dtype=int8, shape=(), strides=c | |
input 7: dtype=int8, shape=(), strides=c | |
input 8: dtype=int8, shape=(), strides=c | |
input 9: dtype=int8, shape=(), strides=c | |
input 10: dtype=int64, shape=(), strides=c | |
input 11: dtype=int64, shape=(), strides=c | |
input 12: dtype=int8, shape=(), strides=c | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
0.1% 94.0% 0.000s 6.29e-06s 11 91 Rebroadcast{0}(GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
... (remaining 163 Apply instances account for 6.04%(0.01s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 9KB (9KB) | |
GPU: 28KB (34KB) | |
CPU + GPU: 38KB (43KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 9KB (9KB) | |
GPU: 33KB (38KB) | |
CPU + GPU: 42KB (48KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 10KB | |
GPU: 52KB | |
CPU + GPU: 63KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
80000B [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0) | |
80000B [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0) | |
40000B [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0) | |
40000B [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0) | |
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
9600B [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
9600B [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
9600B [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
9600B [(12, 1, 200)] c HostFromGpu(GpuJoin.0) | |
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
9600B [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
4800B [(12, 1, 100)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
4800B [(12, 1, 100)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0) | |
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
4800B [(12, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
... (remaining 163 Apply account for 65253B/430053B ((15.17%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 11 calls of the op (for a total of 132 steps) 3.813338e-02s | |
Total time spent in calling the VM 3.587055e-02s (94.066%) | |
Total overhead (computing slices..) 2.262831e-03s (5.934%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
55.7% 55.7% 0.009s 3.56e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm | |
39.3% 95.0% 0.007s 1.67e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise | |
5.0% 100.0% 0.001s 3.22e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
55.7% 55.7% 0.009s 3.56e-05s C 264 2 GpuGemm{no_inplace} | |
13.4% 69.1% 0.002s 1.71e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace} | |
13.0% 82.1% 0.002s 1.66e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
12.9% 95.0% 0.002s 1.64e-05s C 132 1 GpuElemwise{mul,no_inplace} | |
2.7% 97.6% 0.000s 3.42e-06s C 132 1 GpuSubtensor{::, :int64:} | |
2.4% 100.0% 0.000s 3.02e-06s C 132 1 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
29.8% 29.8% 0.005s 3.80e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
25.9% 55.7% 0.004s 3.31e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
13.4% 69.1% 0.002s 1.71e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
13.0% 82.1% 0.002s 1.66e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
12.9% 95.0% 0.002s 1.64e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.7% 97.6% 0.000s 3.42e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.4% 100.0% 0.000s 3.02e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 11 calls of the op (for a total of 132 steps) 3.749466e-02s | |
Total time spent in calling the VM 3.560066e-02s (94.949%) | |
Total overhead (computing slices..) 1.893997e-03s (5.051%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
55.9% 55.9% 0.009s 3.55e-05s C 264 2 theano.sandbox.cuda.blas.GpuGemm | |
39.2% 95.0% 0.007s 1.66e-05s C 396 3 theano.sandbox.cuda.basic_ops.GpuElemwise | |
5.0% 100.0% 0.001s 3.18e-06s C 264 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
55.9% 55.9% 0.009s 3.55e-05s C 264 2 GpuGemm{no_inplace} | |
13.3% 69.1% 0.002s 1.69e-05s C 132 1 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace} | |
13.0% 82.1% 0.002s 1.65e-05s C 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
12.9% 95.0% 0.002s 1.65e-05s C 132 1 GpuElemwise{mul,no_inplace} | |
2.6% 97.6% 0.000s 3.31e-06s C 132 1 GpuSubtensor{::, :int64:} | |
2.4% 100.0% 0.000s 3.04e-06s C 132 1 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
29.8% 29.8% 0.005s 3.79e-05s 132 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
26.1% 55.9% 0.004s 3.32e-05s 132 5 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
13.3% 69.1% 0.002s 1.69e-05s 132 6 GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=c | |
input 3: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
13.0% 82.1% 0.002s 1.65e-05s 132 1 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
12.9% 95.0% 0.002s 1.65e-05s 132 4 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
2.6% 97.6% 0.000s 3.31e-06s 132 2 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
2.4% 100.0% 0.000s 3.04e-06s 132 3 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 2KB (2KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111 | |
Time in 11 calls to Function.__call__: 2.414465e-03s | |
Time in Function.fn.__call__: 2.146721e-03s (88.911%) | |
Time in thunks: 4.596710e-04s (19.038%) | |
Total compile time: 5.729262e+00s | |
Number of Apply nodes: 8 | |
Theano Optimizer time: 3.657293e-02s | |
Theano validate time: 4.487038e-04s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.374197e-02s | |
Import time 5.259037e-03s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.290s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
43.0% 43.0% 0.000s 1.80e-05s C 11 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
20.6% 63.6% 0.000s 4.30e-06s C 22 2 theano.tensor.basic.Alloc | |
14.9% 78.5% 0.000s 3.12e-06s C 22 2 theano.compile.ops.Shape_i | |
7.2% 85.7% 0.000s 3.01e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
7.2% 92.9% 0.000s 2.99e-06s C 11 1 theano.sandbox.cuda.basic_ops.GpuReshape | |
7.1% 100.0% 0.000s 2.97e-06s C 11 1 theano.tensor.opt.MakeVector | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
43.0% 43.0% 0.000s 1.80e-05s C 11 1 HostFromGpu | |
20.6% 63.6% 0.000s 4.30e-06s C 22 2 Alloc | |
14.9% 78.5% 0.000s 3.12e-06s C 22 2 Shape_i{0} | |
7.2% 85.7% 0.000s 3.01e-06s C 11 1 GpuDimShuffle{x,x,0} | |
7.2% 92.9% 0.000s 2.99e-06s C 11 1 GpuReshape{2} | |
7.1% 100.0% 0.000s 2.97e-06s C 11 1 MakeVector{dtype='int64'} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
43.0% 43.0% 0.000s 1.80e-05s 11 7 HostFromGpu(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
10.8% 53.8% 0.000s 4.53e-06s 11 4 Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 12), strides=c | |
9.8% 63.6% 0.000s 4.07e-06s 11 1 Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200}) | |
input 0: dtype=float32, shape=(), strides=c | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int16, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=c | |
8.5% 72.0% 0.000s 3.53e-06s 11 0 Shape_i{0}(generator_generate_attended) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
7.2% 79.3% 0.000s 3.01e-06s 11 3 GpuDimShuffle{x,x,0}(initial_state) | |
input 0: dtype=float32, shape=(100,), strides=c | |
output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
7.2% 86.4% 0.000s 2.99e-06s 11 6 GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) | |
input 1: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=c | |
7.1% 93.5% 0.000s 2.97e-06s 11 5 MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=int64, shape=(), strides=c | |
output 0: dtype=int64, shape=(2,), strides=c | |
6.5% 100.0% 0.000s 2.71e-06s 11 2 Shape_i{0}(initial_state) | |
input 0: dtype=float32, shape=(100,), strides=c | |
output 0: dtype=int64, shape=(), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 1KB (1KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 1KB (1KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 1KB (1KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 1KB (1KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 1KB | |
GPU: 0KB | |
CPU + GPU: 1KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126 | |
Time in 176 calls to Function.__call__: 4.031258e-01s | |
Time in Function.fn.__call__: 3.963535e-01s (98.320%) | |
Time in thunks: 1.376257e-01s (34.140%) | |
Total compile time: 6.464948e+00s | |
Number of Apply nodes: 75 | |
Theano Optimizer time: 4.475892e-01s | |
Theano validate time: 2.268028e-02s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.257081e-01s | |
Import time 3.001761e-02s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.292s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
22.7% 22.7% 0.031s 1.77e-05s C 1760 10 theano.sandbox.cuda.basic_ops.GpuElemwise | |
17.7% 40.4% 0.024s 2.77e-05s C 880 5 theano.sandbox.cuda.blas.GpuDot22 | |
14.8% 55.3% 0.020s 2.90e-05s C 704 4 theano.sandbox.cuda.blas.GpuGemm | |
8.6% 63.9% 0.012s 1.34e-05s C 880 5 theano.sandbox.cuda.basic_ops.GpuFromHost | |
8.1% 72.0% 0.011s 1.58e-05s C 704 4 theano.sandbox.cuda.basic_ops.HostFromGpu | |
7.6% 79.6% 0.011s 1.99e-05s C 528 3 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
5.3% 84.9% 0.007s 4.17e-05s C 176 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
3.0% 87.9% 0.004s 2.90e-06s C 1408 8 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
2.9% 90.8% 0.004s 3.28e-06s C 1232 7 theano.sandbox.cuda.basic_ops.GpuReshape | |
2.8% 93.6% 0.004s 2.43e-06s C 1584 9 theano.tensor.elemwise.Elemwise | |
2.6% 96.2% 0.004s 2.54e-06s C 1408 8 theano.compile.ops.Shape_i | |
2.3% 98.5% 0.003s 2.52e-06s C 1232 7 theano.tensor.opt.MakeVector | |
0.9% 99.4% 0.001s 3.37e-06s C 352 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.3% 99.7% 0.000s 2.54e-06s C 176 1 theano.tensor.elemwise.All | |
0.3% 100.0% 0.000s 2.43e-06s C 176 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
17.7% 17.7% 0.024s 2.77e-05s C 880 5 GpuDot22 | |
14.8% 32.6% 0.020s 2.90e-05s C 704 4 GpuGemm{inplace} | |
8.6% 41.2% 0.012s 1.34e-05s C 880 5 GpuFromHost | |
8.1% 49.3% 0.011s 1.58e-05s C 704 4 HostFromGpu | |
5.3% 54.6% 0.007s 4.17e-05s C 176 1 GpuAdvancedSubtensor1 | |
2.9% 57.5% 0.004s 2.23e-05s C 176 1 GpuCAReduce{maximum}{1,0} | |
2.5% 59.9% 0.003s 3.22e-06s C 1056 6 GpuReshape{2} | |
2.4% 62.4% 0.003s 1.91e-05s C 176 1 GpuCAReduce{add}{1,0,0} | |
2.4% 64.8% 0.003s 1.90e-05s C 176 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)] | |
2.4% 67.2% 0.003s 1.87e-05s C 176 1 GpuElemwise{Mul}[(0, 1)] | |
2.4% 69.5% 0.003s 1.84e-05s C 176 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)] | |
2.3% 71.9% 0.003s 1.84e-05s C 176 1 GpuElemwise{mul,no_inplace} | |
2.3% 74.2% 0.003s 1.83e-05s C 176 1 GpuCAReduce{add}{1,0} | |
2.3% 76.5% 0.003s 1.78e-05s C 176 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
2.3% 78.8% 0.003s 2.52e-06s C 1232 7 MakeVector{dtype='int64'} | |
2.2% 81.0% 0.003s 1.73e-05s C 176 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
2.2% 83.2% 0.003s 1.71e-05s C 176 1 GpuElemwise{Sub}[(0, 1)] | |
2.2% 85.4% 0.003s 1.71e-05s C 176 1 GpuElemwise{Add}[(0, 0)] | |
2.2% 87.5% 0.003s 1.69e-05s C 176 1 GpuElemwise{TrueDiv}[(0, 0)] | |
2.1% 89.7% 0.003s 1.67e-05s C 176 1 GpuElemwise{Tanh}[(0, 0)] | |
... (remaining 21 Ops account for 10.34%(0.01s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.3% 5.3% 0.007s 4.17e-05s 176 10 GpuAdvancedSubtensor1(W, readout_sample_samples) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(1,), strides=(16,) | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
4.5% 9.9% 0.006s 3.54e-05s 176 26 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
4.3% 14.2% 0.006s 3.36e-05s 176 34 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(200, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
3.8% 18.0% 0.005s 3.01e-05s 176 44 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
3.8% 21.8% 0.005s 2.96e-05s 176 21 GpuDot22(GpuFromHost.0, state_to_gates) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
3.4% 25.2% 0.005s 2.68e-05s 176 30 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
3.3% 28.5% 0.005s 2.60e-05s 176 43 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
3.2% 31.8% 0.004s 2.54e-05s 176 47 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
3.1% 34.9% 0.004s 2.42e-05s 176 53 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
3.0% 37.9% 0.004s 2.38e-05s 176 45 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.9% 40.8% 0.004s 2.23e-05s 176 55 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
2.4% 43.2% 0.003s 1.91e-05s 176 73 GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0) | |
input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
2.4% 45.6% 0.003s 1.90e-05s 176 50 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 1: dtype=float32, shape=(1, 1, 100), strides=c | |
input 2: dtype=float32, shape=(1, 1, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=c | |
2.4% 48.0% 0.003s 1.87e-05s 176 72 GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0) | |
input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0) | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
2.4% 50.4% 0.003s 1.84e-05s 176 46 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 2: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 3: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.3% 52.7% 0.003s 1.84e-05s 176 42 GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
2.3% 55.1% 0.003s 1.83e-05s 176 59 GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
2.3% 57.4% 0.003s 1.78e-05s 176 57 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0) | |
input 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
input 1: dtype=float32, shape=(1, 1), strides=(0, 0) | |
input 2: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
2.2% 59.6% 0.003s 1.73e-05s 176 35 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 200), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
2.2% 61.8% 0.003s 1.71e-05s 176 58 GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[[ 1.]]}, GpuFromHost.0) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(12, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(12, 1), strides=(1, 0) | |
... (remaining 55 Apply instances account for 38.24%(0.05s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 1KB (1KB) | |
GPU: 14KB (16KB) | |
CPU + GPU: 15KB (18KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 1KB (1KB) | |
GPU: 14KB (16KB) | |
CPU + GPU: 15KB (18KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 1KB | |
GPU: 18KB | |
CPU + GPU: 20KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
80000B [(200, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0) | |
9600B [(12, 200)] v GpuReshape{2}(GpuFromHost.0, MakeVector{dtype='int64'}.0) | |
9600B [(12, 1, 200)] c GpuFromHost(generator_generate_attended) | |
9600B [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0) | |
4800B [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0) | |
4800B [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
4800B [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0) | |
4800B [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
... (remaining 66 Apply account for 13555B/146355B ((9.26%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137 | |
Time in 176 calls to Function.__call__: 9.610200e-02s | |
Time in Function.fn.__call__: 9.091020e-02s (94.598%) | |
Time in thunks: 3.702688e-02s (38.529%) | |
Total compile time: 4.753222e+00s | |
Number of Apply nodes: 14 | |
Theano Optimizer time: 8.387494e-02s | |
Theano validate time: 2.176523e-03s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.531886e-02s | |
Import time 3.646135e-03s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.305s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
32.1% 32.1% 0.012s 1.69e-05s C 704 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
17.9% 50.0% 0.007s 1.88e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
12.9% 62.9% 0.005s 1.36e-05s C 352 2 theano.sandbox.cuda.basic_ops.GpuFromHost | |
12.9% 75.8% 0.005s 2.71e-05s C 176 1 theano.sandbox.cuda.blas.GpuGemm | |
12.4% 88.2% 0.005s 2.61e-05s C 176 1 theano.sandbox.cuda.blas.GpuDot22 | |
7.8% 96.0% 0.003s 1.64e-05s C 176 1 theano.sandbox.cuda.basic_ops.HostFromGpu | |
4.0% 100.0% 0.001s 2.80e-06s C 528 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
12.9% 12.9% 0.005s 1.36e-05s C 352 2 GpuFromHost | |
12.9% 25.8% 0.005s 2.71e-05s C 176 1 GpuGemm{inplace} | |
12.4% 38.2% 0.005s 2.61e-05s C 176 1 GpuDot22 | |
9.4% 47.6% 0.003s 1.98e-05s C 176 1 GpuCAReduce{maximum}{0,1} | |
8.7% 56.3% 0.003s 1.83e-05s C 176 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
8.5% 64.8% 0.003s 1.79e-05s C 176 1 GpuCAReduce{add}{0,1} | |
7.9% 72.8% 0.003s 1.67e-05s C 176 1 GpuElemwise{Add}[(0, 1)] | |
7.8% 80.6% 0.003s 1.64e-05s C 176 1 HostFromGpu | |
7.8% 88.3% 0.003s 1.64e-05s C 176 1 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)] | |
7.7% 96.0% 0.003s 1.61e-05s C 176 1 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)] | |
2.5% 98.5% 0.001s 2.65e-06s C 352 2 GpuDimShuffle{0,x} | |
1.5% 100.0% 0.001s 3.10e-06s C 176 1 GpuDimShuffle{x,0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
12.9% 12.9% 0.005s 2.71e-05s 176 4 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
12.4% 25.3% 0.005s 2.61e-05s 176 3 GpuDot22(GpuFromHost.0, W) | |
input 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
9.4% 34.7% 0.003s 1.98e-05s 176 6 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1,), strides=(0,) | |
8.7% 43.4% 0.003s 1.83e-05s 176 8 GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=c | |
8.5% 51.9% 0.003s 1.79e-05s 176 9 GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=c | |
output 0: dtype=float32, shape=(1,), strides=c | |
7.9% 59.8% 0.003s 1.67e-05s 176 5 GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
7.8% 67.6% 0.003s 1.64e-05s 176 13 HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
output 0: dtype=float32, shape=(1, 44), strides=c | |
7.8% 75.4% 0.003s 1.64e-05s 176 11 GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0) | |
input 0: dtype=float32, shape=(1, 1), strides=c | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 1), strides=c | |
7.7% 83.1% 0.003s 1.61e-05s 176 12 GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
7.5% 90.6% 0.003s 1.58e-05s 176 0 GpuFromHost(generator_generate_weighted_averages) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
output 0: dtype=float32, shape=(1, 200), strides=(0, 1) | |
5.4% 96.0% 0.002s 1.15e-05s 176 1 GpuFromHost(generator_generate_states) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
output 0: dtype=float32, shape=(1, 100), strides=(0, 1) | |
1.5% 97.5% 0.001s 3.10e-06s 176 2 GpuDimShuffle{x,0}(b) | |
input 0: dtype=float32, shape=(44,), strides=c | |
output 0: dtype=float32, shape=(1, 44), strides=(0, 1) | |
1.3% 98.7% 0.000s 2.67e-06s 176 10 GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0) | |
input 0: dtype=float32, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(1, 1), strides=c | |
1.3% 100.0% 0.000s 2.63e-06s 176 7 GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0) | |
input 0: dtype=float32, shape=(1,), strides=(0,) | |
output 0: dtype=float32, shape=(1, 1), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 1KB (1KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 1KB (1KB) | |
CPU + GPU: 2KB (2KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 2KB | |
CPU + GPU: 2KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181 | |
Time in 1 calls to Function.__call__: 1.907349e-05s | |
Time in Function.fn.__call__: 5.006790e-06s (26.250%) | |
Total compile time: 5.178439e+00s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 5.979061e-03s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 9.393692e-05s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.307s | |
No execution time accumulated (hint: try config profiling.time_thunks=1) | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286 | |
Time in 6075 calls to Function.__call__: 3.723266e-01s | |
Time in Function.fn.__call__: 2.196813e-01s (59.002%) | |
Time in thunks: 4.040527e-02s (10.852%) | |
Total compile time: 3.941077e+00s | |
Number of Apply nodes: 2 | |
Theano Optimizer time: 7.288933e-03s | |
Theano validate time: 0.000000e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.483917e-03s | |
Import time 0.000000e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.307s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
100.0% 100.0% 0.040s 3.33e-06s C 12150 2 theano.compile.ops.DeepCopyOp | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
100.0% 100.0% 0.040s 3.33e-06s C 12150 2 DeepCopyOp | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
60.4% 60.4% 0.024s 4.01e-06s 6075 0 DeepCopyOp(labels) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
39.6% 100.0% 0.016s 2.64e-06s 6075 1 DeepCopyOp(inputs) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 0KB (0KB) | |
CPU + GPU: 0KB (0KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 0KB | |
CPU + GPU: 0KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
... (remaining 2 Apply account for 192B/192B ((100.00%)) of the Apply with dense outputs sizes) | |
All Apply nodes have output sizes that take less than 1024B. | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253 | |
Time in 100 calls to Function.__call__: 8.755362e+01s | |
Time in Function.fn.__call__: 8.736853e+01s (99.789%) | |
Time in thunks: 2.631522e+01s (30.056%) | |
Total compile time: 2.758291e+02s | |
Number of Apply nodes: 3579 | |
Theano Optimizer time: 1.544500e+02s | |
Theano validate time: 5.072355e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.115705e+02s | |
Import time 1.638190e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 673.308s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
78.3% 78.3% 20.607s 2.94e-02s Py 700 7 theano.scan_module.scan_op.Scan | |
6.5% 84.8% 1.718s 2.05e-05s C 83700 837 theano.sandbox.cuda.basic_ops.GpuElemwise | |
3.9% 88.7% 1.028s 1.03e-02s Py 100 1 lvsr.ops.EditDistanceOp | |
2.5% 91.3% 0.661s 2.67e-05s C 24700 247 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
2.1% 93.3% 0.548s 7.40e-05s C 7400 74 theano.sandbox.cuda.blas.GpuDot22 | |
1.4% 94.7% 0.367s 3.68e-06s C 99700 997 theano.tensor.elemwise.Elemwise | |
1.1% 95.8% 0.276s 1.73e-05s C 16000 160 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.6% 96.4% 0.164s 2.28e-05s Py 7200 48 theano.ifelse.IfElse | |
0.6% 97.0% 0.153s 2.74e-05s C 5600 56 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.5% 97.5% 0.134s 8.20e-06s C 16300 163 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.5% 98.0% 0.129s 2.58e-05s C 5000 50 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.4% 98.4% 0.118s 3.42e-06s C 34600 346 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.2% 98.6% 0.056s 1.99e-05s C 2800 28 theano.compile.ops.DeepCopyOp | |
0.2% 98.8% 0.051s 3.83e-06s C 13300 133 theano.tensor.opt.MakeVector | |
0.2% 99.0% 0.047s 4.59e-06s C 10200 102 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 99.2% 0.039s 3.61e-06s C 10700 107 theano.compile.ops.Shape_i | |
0.1% 99.3% 0.037s 1.75e-05s C 2100 21 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.1% 99.4% 0.031s 1.02e-04s Py 300 3 theano.sandbox.cuda.basic_ops.GpuSplit | |
0.1% 99.5% 0.030s 3.04e-06s C 9800 98 theano.tensor.basic.ScalarFromTensor | |
0.1% 99.6% 0.021s 5.34e-05s C 400 4 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
... (remaining 21 Classes account for 0.39%(0.10s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
33.1% 33.1% 8.707s 8.71e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan} | |
15.6% 48.7% 4.113s 2.06e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan} | |
13.0% 61.7% 3.412s 3.41e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan} | |
11.3% 73.0% 2.984s 2.98e-02s Py 100 1 forall_inplace,gpu,generator_generate_scan} | |
5.3% 78.3% 1.390s 6.95e-03s Py 200 2 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
3.9% 82.2% 1.028s 1.03e-02s Py 100 1 EditDistanceOp | |
2.1% 84.3% 0.548s 7.40e-05s C 7400 74 GpuDot22 | |
1.1% 85.3% 0.276s 1.73e-05s C 16000 160 HostFromGpu | |
1.0% 86.3% 0.262s 3.12e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1} | |
0.9% 87.2% 0.235s 2.12e-05s C 11100 111 GpuElemwise{add,no_inplace} | |
0.7% 87.9% 0.182s 2.12e-05s C 8600 86 GpuElemwise{sub,no_inplace} | |
0.6% 88.5% 0.152s 2.45e-05s Py 6200 39 if{gpu} | |
0.6% 89.1% 0.148s 2.28e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace} | |
0.5% 89.6% 0.143s 2.99e-05s C 4800 48 GpuCAReduce{add}{1,1} | |
0.5% 90.1% 0.138s 2.16e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace} | |
0.5% 90.6% 0.128s 1.97e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)] | |
0.5% 91.1% 0.128s 1.88e-05s C 6800 68 GpuElemwise{Mul}[(0, 0)] | |
0.5% 91.6% 0.127s 2.15e-05s C 5900 59 GpuElemwise{Switch,no_inplace} | |
0.5% 92.1% 0.126s 1.95e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)] | |
0.5% 92.5% 0.121s 2.06e-05s C 5900 59 GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace} | |
... (remaining 251 Ops account for 7.47%(1.96s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
33.1% 33.1% 8.707s 8.71e-02s 100 2437 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) | |
input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0) | |
input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) | |
input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
input 27: dtype=int64, shape=(), strides=c | |
input 28: dtype=int64, shape=(), strides=c | |
input 29: dtype=int64, shape=(), strides=c | |
input 30: dtype=int64, shape=(), strides=c | |
input 31: dtype=int64, shape=(), strides=c | |
input 32: dtype=int64, shape=(), strides=c | |
input 33: dtype=int64, shape=(), strides=c | |
input 34: dtype=int64, shape=(), strides=c | |
input 35: dtype=float32, shape=(100, 200), strides=c | |
input 36: dtype=float32, shape=(200, 200), strides=c | |
input 37: dtype=float32, shape=(100, 100), strides=c | |
input 38: dtype=float32, shape=(200, 100), strides=c | |
input 39: dtype=float32, shape=(100, 100), strides=c | |
input 40: dtype=float32, shape=(200, 200), strides=(1, 200) | |
input 41: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 42: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 43: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 44: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 45: dtype=int64, shape=(2,), strides=c | |
input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 47: dtype=int64, shape=(1,), strides=c | |
input 48: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 50: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 51: dtype=int8, shape=(10,), strides=c | |
input 52: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 53: dtype=float32, shape=(100, 200), strides=c | |
input 54: dtype=float32, shape=(200, 200), strides=c | |
input 55: dtype=float32, shape=(100, 100), strides=c | |
input 56: dtype=float32, shape=(200, 100), strides=c | |
input 57: dtype=float32, shape=(100, 100), strides=c | |
input 58: dtype=float32, shape=(200, 200), strides=(1, 200) | |
input 59: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 60: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 61: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 62: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 63: dtype=int64, shape=(2,), strides=c | |
input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 65: dtype=int64, shape=(1,), strides=c | |
input 66: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 68: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 69: dtype=int8, shape=(10,), strides=c | |
input 70: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) | |
output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) | |
13.0% 46.1% 3.412s 3.41e-02s 100 2149 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
input 12: dtype=float32, shape=(100, 200), strides=c | |
input 13: dtype=float32, shape=(200, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(200, 100), strides=c | |
input 16: dtype=float32, shape=(100, 100), strides=c | |
input 17: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 19: dtype=int64, shape=(1,), strides=c | |
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 21: dtype=int8, shape=(10,), strides=c | |
input 22: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(200, 200), strides=c | |
input 25: dtype=float32, shape=(100, 100), strides=c | |
input 26: dtype=float32, shape=(200, 100), strides=c | |
input 27: dtype=float32, shape=(100, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 30: dtype=int64, shape=(1,), strides=c | |
input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 32: dtype=int8, shape=(10,), strides=c | |
input 33: dtype=float32, shape=(100, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
11.3% 57.4% 2.984s 2.98e-02s 100 1850 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(10,), strides=c | |
input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 10), strides=c | |
7.8% 65.2% 2.057s 2.06e-02s 100 2632 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=int64, shape=(), strides=c | |
input 18: dtype=int64, shape=(), strides=c | |
input 19: dtype=float32, shape=(100, 200), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 22: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
input 25: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 26: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
7.8% 73.0% 2.056s 2.06e-02s 100 2631 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=int64, shape=(), strides=c | |
input 18: dtype=int64, shape=(), strides=c | |
input 19: dtype=float32, shape=(100, 200), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 22: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
input 25: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 26: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
3.9% 76.9% 1.028s 1.03e-02s 100 2005 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
input 1: dtype=float32, shape=(15, 10), strides=c | |
input 2: dtype=int64, shape=(12, 10), strides=c | |
input 3: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10, 1), strides=c | |
2.6% 79.6% 0.696s 6.96e-03s 100 1642 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}. | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
2.6% 82.2% 0.694s 6.94e-03s 100 1652 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
0.0% 82.3% 0.013s 1.31e-04s 100 2467 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(200, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(200, 200), strides=(200, 1) | |
0.0% 82.3% 0.013s 1.31e-04s 100 2463 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(200, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(200, 200), strides=(200, 1) | |
0.0% 82.4% 0.013s 1.28e-04s 100 2462 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
0.0% 82.4% 0.012s 1.25e-04s 100 2468 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
0.0% 82.5% 0.012s 1.24e-04s 100 2547 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 150), strides=(1, 100) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
0.0% 82.5% 0.012s 1.19e-04s 100 1117 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(120, 200), strides=(200, 1) | |
0.0% 82.5% 0.012s 1.16e-04s 100 2486 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(1, 100) | |
output 0: dtype=float32, shape=(120, 200), strides=(200, 1) | |
0.0% 82.6% 0.012s 1.16e-04s 100 2540 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 150), strides=(1, 100) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
0.0% 82.6% 0.012s 1.16e-04s 100 2588 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
0.0% 82.7% 0.012s 1.15e-04s 100 1143 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(120, 200), strides=(200, 1) | |
0.0% 82.7% 0.011s 1.10e-04s 100 2590 GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 1: dtype=int8, shape=(), strides=c | |
input 2: dtype=int64, shape=(2,), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
0.0% 82.8% 0.011s 1.09e-04s 100 2664 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 120), strides=(120, 1) | |
input 1: dtype=float32, shape=(120, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
... (remaining 3559 Apply instances account for 17.24%(4.54s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 58KB (62KB) | |
GPU: 3739KB (5373KB) | |
CPU + GPU: 3797KB (5435KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 57KB (62KB) | |
GPU: 5605KB (6697KB) | |
CPU + GPU: 5662KB (6758KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 114KB | |
GPU: 17091KB | |
CPU + GPU: 17205KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0) | |
750480B [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
391680B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200)] i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0) | |
368640B [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0) | |
368640B [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>) | |
368640B [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0) | |
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0) | |
192000B [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0) | |
160000B [(200, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
160000B [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{eq,no_inplace}.0) | |
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0) | |
160000B [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0) | |
160000B [(200, 200)] i GpuElemwise{Sub}[(0, 0)](W, GpuElemwise{Switch,no_inplace}.0) | |
160000B [(200, 200)] c GpuElemwise{Switch,no_inplace}(GpuElemwise{Composite{Cast{float32}(GT((IsNan(i0) + IsInf(i0)), i1))}}[(0, 0)].0, W, GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0) | |
160000B [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0) | |
160000B [(200, 200)] i GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)](GpuDimShuffle{x,x}.0, GpuElemwise{Mul}[(0, 0)].0, GpuDimShuffle{x,x}.0, variance) | |
... (remaining 3559 Apply account for 46215415B/54155015B ((85.34%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 6.864700e-01s | |
Total time spent in calling the VM 6.679530e-01s (97.303%) | |
Total overhead (computing slices..) 1.851702e-02s (2.697%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
53.4% 53.4% 0.172s 3.59e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm | |
41.7% 95.1% 0.134s 1.87e-05s C 7200 6 theano.sandbox.cuda.basic_ops.GpuElemwise | |
4.9% 100.0% 0.016s 3.30e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
53.4% 53.4% 0.172s 3.59e-05s C 4800 4 GpuGemm{no_inplace} | |
15.4% 68.8% 0.050s 2.07e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
13.5% 82.3% 0.043s 1.81e-05s C 2400 2 GpuElemwise{mul,no_inplace} | |
12.8% 95.1% 0.041s 1.72e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
2.6% 97.7% 0.008s 3.48e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
2.3% 100.0% 0.008s 3.13e-06s C 2400 2 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
14.5% 14.5% 0.047s 3.90e-05s 1200 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
13.8% 28.3% 0.045s 3.72e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
12.6% 40.9% 0.041s 3.38e-05s 1200 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
12.5% 53.4% 0.040s 3.37e-05s 1200 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.8% 61.2% 0.025s 2.09e-05s 1200 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.6% 68.8% 0.024s 2.04e-05s 1200 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.9% 75.7% 0.022s 1.84e-05s 1200 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.6% 82.3% 0.021s 1.78e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.5% 88.7% 0.021s 1.74e-05s 1200 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
6.4% 95.1% 0.021s 1.71e-05s 1200 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.3% 96.4% 0.004s 3.56e-06s 1200 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.3% 97.7% 0.004s 3.40e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.2% 98.9% 0.004s 3.22e-06s 1200 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.1% 100.0% 0.004s 3.04e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 39KB | |
CPU + GPU: 39KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
... (remaining 0 Apply account for 0B/72000B ((0.00%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 6.850390e-01s | |
Total time spent in calling the VM 6.670289e-01s (97.371%) | |
Total overhead (computing slices..) 1.801014e-02s (2.629%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
53.5% 53.5% 0.172s 3.59e-05s C 4800 4 theano.sandbox.cuda.blas.GpuGemm | |
41.6% 95.1% 0.134s 1.86e-05s C 7200 6 theano.sandbox.cuda.basic_ops.GpuElemwise | |
4.9% 100.0% 0.016s 3.28e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
53.5% 53.5% 0.172s 3.59e-05s C 4800 4 GpuGemm{no_inplace} | |
15.3% 68.8% 0.049s 2.05e-05s C 2400 2 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace} | |
13.5% 82.2% 0.043s 1.81e-05s C 2400 2 GpuElemwise{mul,no_inplace} | |
12.9% 95.1% 0.041s 1.73e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
2.6% 97.7% 0.008s 3.48e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
2.3% 100.0% 0.007s 3.09e-06s C 2400 2 GpuSubtensor{::, int64::} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
14.5% 14.5% 0.047s 3.90e-05s 1200 0 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
13.8% 28.3% 0.045s 3.71e-05s 1200 1 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
12.6% 40.9% 0.041s 3.38e-05s 1200 10 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
12.6% 53.5% 0.041s 3.38e-05s 1200 11 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.7% 61.2% 0.025s 2.07e-05s 1200 12 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
7.6% 68.8% 0.024s 2.03e-05s 1200 13 GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
input 5: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.8% 75.6% 0.022s 1.83e-05s 1200 8 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.6% 82.2% 0.021s 1.79e-05s 1200 9 GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
6.4% 88.7% 0.021s 1.73e-05s 1200 2 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
6.4% 95.1% 0.021s 1.73e-05s 1200 3 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.3% 96.4% 0.004s 3.49e-06s 1200 6 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.3% 97.7% 0.004s 3.47e-06s 1200 4 GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.1% 98.9% 0.004s 3.09e-06s 1200 5 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.1% 100.0% 0.004s 3.08e-06s 1200 7 GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 20KB (27KB) | |
CPU + GPU: 20KB (27KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 39KB | |
CPU + GPU: 39KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
... (remaining 0 Apply account for 0B/72000B ((0.00%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( generator_generate_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 2.965537e+00s | |
Total time spent in calling the VM 2.812608e+00s (94.843%) | |
Total overhead (computing slices..) 1.529298e-01s (5.157%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
29.5% 29.5% 0.372s 1.91e-05s C 19500 13 theano.sandbox.cuda.basic_ops.GpuElemwise | |
17.2% 46.7% 0.217s 2.89e-05s C 7500 5 theano.sandbox.cuda.blas.GpuGemm | |
15.2% 61.9% 0.192s 2.56e-05s C 7500 5 theano.sandbox.cuda.blas.GpuDot22 | |
11.6% 73.4% 0.146s 1.95e-05s C 7500 5 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
5.2% 78.6% 0.065s 4.35e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
5.2% 83.8% 0.065s 4.35e-05s C 1500 1 theano.sandbox.rng_mrg.GPU_mrg_uniform | |
4.4% 88.2% 0.056s 1.87e-05s C 3000 2 theano.sandbox.cuda.basic_ops.HostFromGpu | |
2.2% 90.4% 0.028s 1.84e-05s C 1500 1 theano.tensor.basic.MaxAndArgmax | |
1.9% 92.3% 0.024s 1.59e-05s C 1500 1 theano.sandbox.cuda.basic_ops.GpuFromHost | |
1.8% 94.1% 0.023s 2.56e-06s C 9000 6 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.4% 95.5% 0.018s 2.38e-06s C 7500 5 theano.compile.ops.Shape_i | |
1.2% 96.8% 0.015s 3.41e-06s C 4500 3 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.9% 97.7% 0.012s 1.97e-06s C 6000 4 theano.tensor.opt.MakeVector | |
0.8% 98.5% 0.010s 3.48e-06s C 3000 2 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.6% 99.1% 0.007s 2.32e-06s C 3000 2 theano.tensor.elemwise.Elemwise | |
0.5% 99.6% 0.007s 4.48e-06s C 1500 1 theano.sandbox.multinomial.MultinomialFromUniform | |
0.4% 100.0% 0.005s 3.31e-06s C 1500 1 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
17.2% 17.2% 0.217s 2.89e-05s C 7500 5 GpuGemm{inplace} | |
15.2% 32.4% 0.192s 2.56e-05s C 7500 5 GpuDot22 | |
5.2% 37.6% 0.065s 4.35e-05s C 1500 1 GpuAdvancedSubtensor1 | |
5.2% 42.7% 0.065s 4.35e-05s C 1500 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace} | |
5.0% 47.7% 0.063s 2.10e-05s C 3000 2 GpuElemwise{mul,no_inplace} | |
4.4% 52.2% 0.056s 1.87e-05s C 3000 2 HostFromGpu | |
2.8% 55.0% 0.036s 2.37e-05s C 1500 1 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace} | |
2.6% 57.6% 0.033s 2.17e-05s C 1500 1 GpuCAReduce{add}{1,0,0} | |
2.6% 60.1% 0.032s 2.15e-05s C 1500 1 GpuCAReduce{maximum}{1,0} | |
2.5% 62.6% 0.031s 2.09e-05s C 1500 1 GpuElemwise{add,no_inplace} | |
2.3% 64.9% 0.029s 1.91e-05s C 1500 1 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
2.2% 67.1% 0.028s 1.88e-05s C 1500 1 GpuCAReduce{maximum}{0,1} | |
2.2% 69.3% 0.028s 1.84e-05s C 1500 1 MaxAndArgmax | |
2.2% 71.5% 0.027s 1.83e-05s C 1500 1 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
2.2% 73.6% 0.027s 1.81e-05s C 1500 1 GpuElemwise{Add}[(0, 1)] | |
2.1% 75.8% 0.027s 1.80e-05s C 1500 1 GpuElemwise{Tanh}[(0, 0)] | |
2.1% 77.9% 0.027s 1.80e-05s C 1500 1 GpuElemwise{Composite{exp((i0 - i1))},no_inplace} | |
2.1% 80.0% 0.027s 1.78e-05s C 1500 1 GpuElemwise{TrueDiv}[(0, 0)] | |
2.1% 82.1% 0.027s 1.77e-05s C 1500 1 GpuCAReduce{add}{1,0} | |
2.1% 84.2% 0.027s 1.77e-05s C 1500 1 GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)] | |
... (remaining 20 Ops account for 15.75%(0.20s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.2% 5.2% 0.065s 4.35e-05s 1500 29 GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
input 0: dtype=float32, shape=(45, 100), strides=c | |
input 1: dtype=int64, shape=(10,), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
5.2% 10.3% 0.065s 4.35e-05s 1500 13 GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
input 0: dtype=float32, shape=(92160,), strides=c | |
input 1: dtype=int64, shape=(1,), strides=c | |
output 0: dtype=float32, shape=(92160,), strides=c | |
output 1: dtype=float32, shape=(10,), strides=c | |
4.2% 14.6% 0.053s 3.55e-05s 1500 10 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
3.6% 18.2% 0.046s 3.06e-05s 1500 38 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.4% 21.6% 0.043s 2.90e-05s 1500 5 GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
3.3% 25.0% 0.042s 2.79e-05s 1500 8 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 44), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 44), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
3.2% 28.1% 0.040s 2.67e-05s 1500 32 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
3.0% 31.1% 0.038s 2.52e-05s 1500 1 GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 44), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
2.9% 34.1% 0.037s 2.47e-05s 1500 37 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.9% 37.0% 0.037s 2.46e-05s 1500 41 GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.9% 39.9% 0.037s 2.44e-05s 1500 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=c | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=c | |
2.8% 42.7% 0.036s 2.39e-05s 1500 39 GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.8% 45.6% 0.036s 2.37e-05s 1500 40 GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
input 0: dtype=float32, shape=(1, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(1, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.8% 48.3% 0.035s 2.34e-05s 1500 56 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
input 0: dtype=float32, shape=(12, 10, 1), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
2.6% 50.9% 0.033s 2.17e-05s 1500 57 GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
input 0: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.6% 53.5% 0.032s 2.15e-05s 1500 48 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=float32, shape=(10,), strides=c | |
2.5% 56.0% 0.031s 2.09e-05s 1500 43 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=c | |
output 0: dtype=float32, shape=(12, 10, 100), strides=c | |
2.4% 58.4% 0.030s 2.01e-05s 1500 25 HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 44), strides=c | |
output 0: dtype=float32, shape=(10, 44), strides=c | |
2.3% 60.6% 0.029s 1.91e-05s 1500 33 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(1, 200), strides=c | |
input 1: dtype=float32, shape=(10, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.2% 62.9% 0.028s 1.88e-05s 1500 18 GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0) | |
input 0: dtype=float32, shape=(10, 44), strides=c | |
output 0: dtype=float32, shape=(10,), strides=c | |
... (remaining 38 Apply instances account for 37.13%(0.47s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 5KB (5KB) | |
GPU: 465KB (465KB) | |
CPU + GPU: 471KB (471KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 5KB (5KB) | |
GPU: 465KB (465KB) | |
CPU + GPU: 471KB (471KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 5KB | |
GPU: 540KB | |
CPU + GPU: 545KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
368680B [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda]) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda]) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
4000B [(10, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0) | |
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0) | |
4000B [(10, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax) | |
4000B [(10, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda]) | |
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda]) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
4000B [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]}) | |
... (remaining 38 Apply account for 21274B/709954B ((3.00%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 3.388380e+00s | |
Total time spent in calling the VM 3.311884e+00s (97.742%) | |
Total overhead (computing slices..) 7.649612e-02s (2.258%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
37.8% 37.8% 0.518s 1.92e-05s C 27000 18 theano.sandbox.cuda.basic_ops.GpuElemwise | |
22.4% 60.2% 0.307s 2.56e-05s C 12000 8 theano.sandbox.cuda.blas.GpuDot22 | |
14.3% 74.4% 0.196s 3.26e-05s C 6000 4 theano.sandbox.cuda.blas.GpuGemm | |
12.6% 87.0% 0.172s 1.92e-05s C 9000 6 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
3.5% 90.5% 0.047s 1.58e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuFromHost | |
2.5% 93.0% 0.035s 2.56e-06s C 13500 9 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.6% 94.5% 0.021s 2.37e-06s C 9000 6 theano.compile.ops.Shape_i | |
1.5% 96.0% 0.020s 3.37e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
1.4% 97.4% 0.019s 3.24e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuReshape | |
1.0% 98.4% 0.014s 2.29e-06s C 6000 4 theano.tensor.elemwise.Elemwise | |
0.8% 99.3% 0.012s 1.94e-06s C 6000 4 theano.tensor.opt.MakeVector | |
0.7% 100.0% 0.010s 3.30e-06s C 3000 2 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
22.4% 22.4% 0.307s 2.56e-05s C 12000 8 GpuDot22 | |
14.3% 36.7% 0.196s 3.26e-05s C 6000 4 GpuGemm{inplace} | |
9.1% 45.8% 0.125s 2.09e-05s C 6000 4 GpuElemwise{mul,no_inplace} | |
4.8% 50.6% 0.065s 2.18e-05s C 3000 2 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace} | |
4.7% 55.3% 0.065s 2.17e-05s C 3000 2 GpuCAReduce{maximum}{1,0} | |
4.5% 59.8% 0.062s 2.07e-05s C 3000 2 GpuElemwise{add,no_inplace} | |
4.0% 63.8% 0.055s 1.82e-05s C 3000 2 GpuCAReduce{add}{1,0,0} | |
3.9% 67.7% 0.054s 1.79e-05s C 3000 2 GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)] | |
3.9% 71.6% 0.053s 1.78e-05s C 3000 2 GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)] | |
3.9% 75.5% 0.053s 1.77e-05s C 3000 2 GpuElemwise{TrueDiv}[(0, 0)] | |
3.9% 79.4% 0.053s 1.77e-05s C 3000 2 GpuElemwise{Tanh}[(0, 0)] | |
3.9% 83.2% 0.053s 1.76e-05s C 3000 2 GpuCAReduce{add}{1,0} | |
3.8% 87.0% 0.052s 1.72e-05s C 3000 2 GpuElemwise{Add}[(0, 0)] | |
3.5% 90.5% 0.047s 1.58e-05s C 3000 2 GpuFromHost | |
1.4% 91.9% 0.019s 3.24e-06s C 6000 4 GpuReshape{2} | |
0.8% 92.7% 0.012s 1.94e-06s C 6000 4 MakeVector{dtype='int64'} | |
0.8% 93.6% 0.012s 2.56e-06s C 4500 3 GpuDimShuffle{x,0} | |
0.8% 94.3% 0.011s 3.59e-06s C 3000 2 GpuSubtensor{::, :int64:} | |
0.7% 95.0% 0.009s 3.15e-06s C 3000 2 GpuSubtensor{::, int64::} | |
0.7% 95.7% 0.009s 3.02e-06s C 3000 2 Shape_i{1} | |
... (remaining 10 Ops account for 4.30%(0.06s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
3.9% 3.9% 0.053s 3.54e-05s 1500 11 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
3.9% 7.7% 0.053s 3.54e-05s 1500 14 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
3.3% 11.0% 0.045s 2.99e-05s 1500 32 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
3.3% 14.3% 0.045s 2.99e-05s 1500 33 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
3.2% 17.5% 0.044s 2.95e-05s 1500 3 GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
3.1% 20.6% 0.043s 2.87e-05s 1500 8 GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.7% 23.3% 0.037s 2.46e-05s 1500 31 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 26.0% 0.037s 2.45e-05s 1500 30 GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 28.7% 0.037s 2.44e-05s 1500 47 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=(1, 0) | |
2.7% 31.4% 0.037s 2.44e-05s 1500 36 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 34.0% 0.036s 2.43e-05s 1500 37 GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 36.7% 0.036s 2.43e-05s 1500 46 GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>) | |
input 0: dtype=float32, shape=(120, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 1), strides=c | |
output 0: dtype=float32, shape=(120, 1), strides=(1, 0) | |
2.6% 39.2% 0.035s 2.35e-05s 1500 69 GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
2.5% 41.8% 0.035s 2.31e-05s 1500 65 GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 1: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
2.4% 44.2% 0.033s 2.21e-05s 1500 50 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 10), strides=(10, 1) | |
output 0: dtype=float32, shape=(10,), strides=(1,) | |
2.4% 46.6% 0.033s 2.20e-05s 1500 34 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 4: dtype=float32, shape=(10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 1), strides=c | |
input 6: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 49.0% 0.032s 2.16e-05s 1500 35 GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 1), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 4: dtype=float32, shape=(10, 100), strides=c | |
input 5: dtype=float32, shape=(1, 1), strides=c | |
input 6: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.3% 51.3% 0.032s 2.13e-05s 1500 51 GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(12, 10), strides=(10, 1) | |
output 0: dtype=float32, shape=(10,), strides=(1,) | |
2.3% 53.6% 0.031s 2.08e-05s 1500 40 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
2.2% 55.8% 0.031s 2.06e-05s 1500 41 GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0) | |
input 0: dtype=float32, shape=(12, 10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
... (remaining 51 Apply instances account for 44.20%(0.61s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 118KB (118KB) | |
CPU + GPU: 118KB (118KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 118KB (149KB) | |
CPU + GPU: 118KB (149KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 345KB | |
CPU + GPU: 345KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0) | |
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda]) | |
8000B [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda]) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0) | |
8000B [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0) | |
4000B [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100}) | |
4000B [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda]) | |
4000B [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
... (remaining 51 Apply account for 53988B/613988B ((8.79%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1500 steps) 8.660982e+00s | |
Total time spent in calling the VM 8.414116e+00s (97.150%) | |
Total overhead (computing slices..) 2.468655e-01s (2.850%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
45.6% 45.6% 1.678s 1.86e-05s C 90000 60 theano.sandbox.cuda.basic_ops.GpuElemwise | |
17.5% 63.1% 0.643s 2.68e-05s C 24000 16 theano.sandbox.cuda.blas.GpuDot22 | |
13.0% 76.1% 0.480s 3.20e-05s C 15000 10 theano.sandbox.cuda.blas.GpuGemm | |
9.5% 85.7% 0.351s 1.95e-05s C 18000 12 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
3.0% 88.7% 0.111s 1.86e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
2.4% 91.1% 0.088s 1.47e-05s C 6000 4 theano.sandbox.cuda.basic_ops.GpuFromHost | |
2.2% 93.3% 0.081s 2.47e-06s C 33000 22 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
1.4% 94.8% 0.053s 1.77e-05s C 3000 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
1.4% 96.2% 0.052s 3.47e-06s C 15000 10 theano.sandbox.cuda.basic_ops.GpuReshape | |
1.1% 97.3% 0.040s 2.25e-06s C 18000 12 theano.compile.ops.Shape_i | |
0.8% 98.1% 0.030s 2.52e-06s C 12000 8 theano.tensor.elemwise.Elemwise | |
0.7% 98.8% 0.025s 2.12e-06s C 12000 8 theano.tensor.opt.MakeVector | |
0.6% 99.4% 0.023s 3.89e-06s C 6000 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.6% 100.0% 0.021s 3.58e-06s C 6000 4 theano.tensor.elemwise.DimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
17.5% 17.5% 0.643s 2.68e-05s C 24000 16 GpuDot22 | |
10.2% 27.7% 0.374s 3.12e-05s C 12000 8 GpuGemm{inplace} | |
7.7% 35.4% 0.285s 1.90e-05s C 15000 10 GpuElemwise{mul,no_inplace} | |
4.6% 40.0% 0.168s 1.86e-05s C 9000 6 GpuElemwise{add,no_inplace} | |
4.2% 44.2% 0.156s 1.73e-05s C 9000 6 GpuCAReduce{add}{1,0} | |
3.7% 47.9% 0.136s 1.81e-05s C 7500 5 GpuElemwise{Add}[(0, 1)] | |
3.5% 51.4% 0.128s 1.71e-05s C 7500 5 GpuElemwise{Add}[(0, 0)] | |
2.9% 54.3% 0.106s 3.53e-05s C 3000 2 GpuGemm{no_inplace} | |
2.4% 56.7% 0.088s 1.47e-05s C 6000 4 GpuFromHost | |
2.3% 58.9% 0.084s 2.81e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace} | |
1.9% 60.9% 0.071s 2.36e-05s C 3000 2 GpuCAReduce{maximum}{1,0} | |
1.9% 62.8% 0.070s 2.33e-05s C 3000 2 GpuCAReduce{add}{0,0,1} | |
1.7% 64.5% 0.063s 2.09e-05s C 3000 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
1.7% 66.2% 0.062s 2.06e-05s C 3000 2 GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)] | |
1.6% 67.8% 0.060s 1.99e-05s C 3000 2 GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)] | |
1.6% 69.3% 0.057s 1.91e-05s C 3000 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
1.5% 70.9% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace} | |
1.5% 72.4% 0.057s 1.89e-05s C 3000 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
1.5% 73.9% 0.056s 1.86e-05s C 3000 2 GpuElemwise{Composite{(i0 + (i1 * i2 * i3))}}[(0, 0)] | |
1.5% 75.4% 0.054s 1.81e-05s C 3000 2 GpuCAReduce{add}{1,0,0} | |
... (remaining 30 Ops account for 24.59%(0.90s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
1.5% 1.5% 0.055s 3.65e-05s 1500 26 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.5% 3.0% 0.054s 3.61e-05s 1500 35 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.5% 4.4% 0.054s 3.57e-05s 1500 146 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.4% 5.9% 0.053s 3.54e-05s 1500 166 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(200, 200), strides=(1, 200) | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.4% 7.3% 0.053s 3.53e-05s 1500 168 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(200, 200), strides=(1, 200) | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.4% 8.7% 0.052s 3.48e-05s 1500 147 GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.3% 10.0% 0.047s 3.15e-05s 1500 80 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.3% 11.3% 0.047s 3.15e-05s 1500 82 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.2% 12.5% 0.046s 3.04e-05s 1500 167 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.2% 13.8% 0.045s 3.02e-05s 1500 169 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.2% 15.0% 0.044s 2.96e-05s 1500 2 GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.2% 16.1% 0.043s 2.88e-05s 1500 15 GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 200), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.2% 17.3% 0.043s 2.86e-05s 1500 116 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
input 0: dtype=float32, shape=(1, 10, 200), strides=c | |
input 1: dtype=float32, shape=(12, 10, 1), strides=c | |
input 2: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
1.1% 18.4% 0.042s 2.77e-05s 1500 117 GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
input 0: dtype=float32, shape=(1, 10, 200), strides=c | |
input 1: dtype=float32, shape=(12, 10, 1), strides=c | |
input 2: dtype=float32, shape=(12, 10, 200), strides=c | |
output 0: dtype=float32, shape=(12, 10, 200), strides=c | |
1.1% 19.5% 0.041s 2.76e-05s 1500 133 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 120), strides=c | |
input 1: dtype=float32, shape=(120, 1), strides=c | |
output 0: dtype=float32, shape=(100, 1), strides=c | |
1.1% 20.7% 0.041s 2.75e-05s 1500 131 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 120), strides=c | |
input 1: dtype=float32, shape=(120, 1), strides=c | |
output 0: dtype=float32, shape=(100, 1), strides=c | |
1.1% 21.8% 0.041s 2.71e-05s 1500 21 GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.1% 22.9% 0.041s 2.70e-05s 1500 8 GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
1.1% 24.0% 0.040s 2.67e-05s 1500 172 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
1.1% 25.1% 0.040s 2.67e-05s 1500 170 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
... (remaining 156 Apply instances account for 74.95%(2.76s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 275KB (376KB) | |
CPU + GPU: 275KB (377KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 299KB (377KB) | |
CPU + GPU: 299KB (378KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 890KB | |
CPU + GPU: 890KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda]) | |
96000B [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
96000B [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda]) | |
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0) | |
48000B [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0) | |
48000B [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0) | |
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0) | |
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
48000B [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
48000B [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0) | |
48000B [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>) | |
48000B [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
48000B [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>) | |
... (remaining 156 Apply account for 346392B/1498392B ((23.12%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 2.039070e+00s | |
Total time spent in calling the VM 1.921504e+00s (94.234%) | |
Total overhead (computing slices..) 1.175666e-01s (5.766%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
47.2% 47.2% 0.424s 1.77e-05s C 24000 20 theano.sandbox.cuda.basic_ops.GpuElemwise | |
27.9% 75.1% 0.251s 3.48e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm | |
9.8% 84.8% 0.088s 1.83e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
6.7% 91.6% 0.060s 2.52e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22 | |
4.6% 96.2% 0.041s 1.73e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
2.0% 98.1% 0.018s 3.69e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
1.2% 99.4% 0.011s 2.34e-06s C 4800 4 theano.compile.ops.Shape_i | |
0.6% 100.0% 0.005s 2.25e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
19.8% 19.8% 0.178s 3.71e-05s C 4800 4 GpuGemm{no_inplace} | |
14.5% 34.3% 0.130s 1.81e-05s C 7200 6 GpuElemwise{mul,no_inplace} | |
8.1% 42.4% 0.073s 3.03e-05s C 2400 2 GpuGemm{inplace} | |
6.7% 49.1% 0.060s 2.52e-05s C 2400 2 GpuDot22 | |
5.4% 54.5% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
5.0% 59.5% 0.045s 1.89e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
4.9% 64.5% 0.044s 1.85e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
4.7% 69.2% 0.043s 1.77e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:} | |
4.6% 73.8% 0.042s 1.74e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
4.6% 78.4% 0.041s 1.73e-05s C 2400 2 GpuAlloc{memset_0=True} | |
4.5% 83.0% 0.041s 1.70e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace} | |
4.5% 87.5% 0.041s 1.69e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)] | |
4.4% 91.9% 0.040s 1.65e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)] | |
4.3% 96.2% 0.038s 1.60e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)] | |
1.0% 97.2% 0.009s 3.79e-06s C 2400 2 GpuSubtensor{::, int64::} | |
1.0% 98.1% 0.009s 3.60e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
0.6% 98.8% 0.006s 2.41e-06s C 2400 2 Shape_i{1} | |
0.6% 99.4% 0.005s 2.27e-06s C 2400 2 Shape_i{0} | |
0.6% 100.0% 0.005s 2.25e-06s C 2400 2 GpuDimShuffle{1,0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.4% 5.4% 0.049s 4.07e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
5.2% 10.6% 0.046s 3.86e-05s 1200 6 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
4.6% 15.2% 0.041s 3.45e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.6% 19.8% 0.041s 3.45e-05s 1200 18 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.0% 23.9% 0.036s 3.03e-05s 1200 40 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
4.0% 27.9% 0.036s 3.03e-05s 1200 41 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
3.4% 31.3% 0.030s 2.52e-05s 1200 28 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
3.3% 34.6% 0.030s 2.51e-05s 1200 29 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 37.3% 0.024s 2.02e-05s 1200 42 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=c | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.7% 40.0% 0.024s 2.00e-05s 1200 43 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=c | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.6% 42.5% 0.023s 1.91e-05s 1200 34 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.5% 45.0% 0.022s 1.87e-05s 1200 24 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.5% 47.5% 0.022s 1.86e-05s 1200 35 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=(200, 1) | |
2.5% 50.0% 0.022s 1.85e-05s 1200 16 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 52.4% 0.022s 1.83e-05s 1200 25 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=(100, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 54.9% 0.022s 1.83e-05s 1200 17 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 57.3% 0.022s 1.81e-05s 1200 3 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 59.7% 0.022s 1.79e-05s 1200 31 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 62.1% 0.021s 1.79e-05s 1200 7 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
2.4% 64.5% 0.021s 1.79e-05s 1200 30 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(10, 100), strides=(200, 1) | |
output 0: dtype=float32, shape=(10, 100), strides=(100, 1) | |
... (remaining 24 Apply instances account for 35.55%(0.32s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 55KB (78KB) | |
CPU + GPU: 55KB (78KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 66KB (86KB) | |
CPU + GPU: 66KB (86KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 94KB | |
CPU + GPU: 94KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
4000B [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan ) | |
================== | |
Message: None | |
Time in 100 calls of the op (for a total of 1200 steps) 2.040347e+00s | |
Total time spent in calling the VM 1.923158e+00s (94.256%) | |
Total overhead (computing slices..) 1.171889e-01s (5.744%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
47.2% 47.2% 0.424s 1.77e-05s C 24000 20 theano.sandbox.cuda.basic_ops.GpuElemwise | |
27.9% 75.1% 0.251s 3.49e-05s C 7200 6 theano.sandbox.cuda.blas.GpuGemm | |
9.8% 84.9% 0.088s 1.83e-05s C 4800 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
6.7% 91.6% 0.060s 2.51e-05s C 2400 2 theano.sandbox.cuda.blas.GpuDot22 | |
4.6% 96.2% 0.041s 1.73e-05s C 2400 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
2.0% 98.2% 0.018s 3.68e-06s C 4800 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
1.2% 99.4% 0.011s 2.29e-06s C 4800 4 theano.compile.ops.Shape_i | |
0.6% 100.0% 0.005s 2.28e-06s C 2400 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
19.8% 19.8% 0.178s 3.71e-05s C 4800 4 GpuGemm{no_inplace} | |
14.5% 34.3% 0.130s 1.81e-05s C 7200 6 GpuElemwise{mul,no_inplace} | |
8.1% 42.4% 0.073s 3.04e-05s C 2400 2 GpuGemm{inplace} | |
6.7% 49.1% 0.060s 2.51e-05s C 2400 2 GpuDot22 | |
5.4% 54.5% 0.048s 2.01e-05s C 2400 2 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)] | |
5.0% 59.5% 0.045s 1.89e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, int64::} | |
5.0% 64.5% 0.045s 1.86e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace} | |
4.7% 69.2% 0.042s 1.77e-05s C 2400 2 GpuIncSubtensor{InplaceInc;::, :int64:} | |
4.6% 73.9% 0.042s 1.73e-05s C 2400 2 GpuElemwise{ScalarSigmoid}[(0, 0)] | |
4.6% 78.5% 0.041s 1.73e-05s C 2400 2 GpuAlloc{memset_0=True} | |
4.5% 83.0% 0.041s 1.70e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace} | |
4.5% 87.5% 0.040s 1.68e-05s C 2400 2 GpuElemwise{Tanh}[(0, 0)] | |
4.4% 91.9% 0.040s 1.66e-05s C 2400 2 GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)] | |
4.3% 96.2% 0.039s 1.61e-05s C 2400 2 GpuElemwise{Mul}[(0, 0)] | |
1.0% 97.2% 0.009s 3.78e-06s C 2400 2 GpuSubtensor{::, int64::} | |
1.0% 98.2% 0.009s 3.59e-06s C 2400 2 GpuSubtensor{::, :int64:} | |
0.6% 98.8% 0.006s 2.37e-06s C 2400 2 Shape_i{1} | |
0.6% 99.4% 0.005s 2.28e-06s C 2400 2 GpuDimShuffle{1,0} | |
0.6% 100.0% 0.005s 2.21e-06s C 2400 2 Shape_i{0} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
5.4% 5.4% 0.049s 4.08e-05s 1200 2 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
5.2% 10.6% 0.046s 3.86e-05s 1200 6 GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 200), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
4.6% 15.2% 0.041s 3.45e-05s 1200 20 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
4.6% 19.8% 0.041s 3.45e-05s 1200 18 GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(100, 100), strides=c | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
4.1% 23.9% 0.036s 3.04e-05s 1200 40 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
4.1% 27.9% 0.036s 3.04e-05s 1200 41 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0}) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(), strides=c | |
input 2: dtype=float32, shape=(10, 200), strides=c | |
input 3: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 4: dtype=float32, shape=(), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.4% 31.3% 0.030s 2.51e-05s 1200 28 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
3.3% 34.6% 0.030s 2.51e-05s 1200 29 GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda]) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 37.3% 0.024s 2.02e-05s 1200 42 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=c | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.7% 40.0% 0.024s 2.01e-05s 1200 43 GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(1, 1), strides=c | |
input 2: dtype=float32, shape=(10, 100), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
input 4: dtype=float32, shape=(10, 1), strides=c | |
input 5: dtype=float32, shape=(10, 100), strides=c | |
input 6: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.6% 42.6% 0.023s 1.91e-05s 1200 34 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.5% 45.1% 0.023s 1.88e-05s 1200 24 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.5% 47.6% 0.022s 1.86e-05s 1200 35 GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
input 0: dtype=float32, shape=(10, 200), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=int64, shape=8, strides=c | |
output 0: dtype=float32, shape=(10, 200), strides=c | |
2.5% 50.0% 0.022s 1.84e-05s 1200 25 GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
input 2: dtype=float32, shape=(1, 1), strides=c | |
input 3: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.5% 52.5% 0.022s 1.84e-05s 1200 16 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 54.9% 0.022s 1.82e-05s 1200 17 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 57.3% 0.022s 1.80e-05s 1200 3 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 59.7% 0.022s 1.79e-05s 1200 30 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 62.1% 0.021s 1.79e-05s 1200 31 GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 100), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
2.4% 64.5% 0.021s 1.79e-05s 1200 7 GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>) | |
input 0: dtype=float32, shape=(10, 100), strides=c | |
input 1: dtype=float32, shape=(10, 1), strides=c | |
output 0: dtype=float32, shape=(10, 100), strides=c | |
... (remaining 24 Apply instances account for 35.49%(0.32s) of the runtime) | |
Memory Profile | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 0KB (0KB) | |
GPU: 55KB (78KB) | |
CPU + GPU: 55KB (78KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 0KB (0KB) | |
GPU: 66KB (86KB) | |
CPU + GPU: 66KB (86KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 0KB | |
GPU: 94KB | |
CPU + GPU: 94KB | |
--- | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100}) | |
8000B [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0}) | |
8000B [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]}) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
8000B [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0) | |
8000B [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda]) | |
4000B [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0) | |
4000B [(10, 100)] i GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0) | |
4000B [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100}) | |
4000B [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0) | |
... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. | |
Function profiling | |
================== | |
Message: Sum of all(17) printed profiles at exit excluding Scan op profile. | |
Time in 6938 calls to Function.__call__: 1.007439e+02s | |
Time in Function.fn.__call__: 1.003767e+02s (99.635%) | |
Time in thunks: 3.835574e+01s (38.073%) | |
Total compile time: 3.784477e+02s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 1.654243e+02s | |
Theano validate time: 5.543999e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 1.313228e+02s | |
Import time 2.099285e+00s | |
Time in all call to theano.grad() 2.838947e+00s | |
Time since theano import 676.605s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
61.4% 61.4% 23.536s 2.79e-02s Py 844 11 theano.scan_module.scan_op.Scan | |
25.3% 86.7% 9.712s 6.03e-02s Py 161 2 lvsr.ops.EditDistanceOp | |
4.7% 91.3% 1.787s 2.06e-05s C 86853 879 theano.sandbox.cuda.basic_ops.GpuElemwise | |
1.8% 93.1% 0.678s 2.65e-05s C 25580 252 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.7% 94.8% 0.642s 7.29e-05s C 8805 89 theano.sandbox.cuda.blas.GpuDot22 | |
1.0% 95.8% 0.395s 3.60e-06s C 109687 1234 theano.tensor.elemwise.Elemwise | |
0.8% 96.6% 0.297s 1.72e-05s C 17247 197 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.4% 97.0% 0.166s 2.21e-05s Py 7505 51 theano.ifelse.IfElse | |
0.4% 97.4% 0.161s 2.71e-05s C 5927 63 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 97.8% 0.142s 7.60e-06s C 18640 198 theano.sandbox.cuda.basic_ops.GpuReshape | |
0.4% 98.2% 0.138s 2.62e-05s C 5266 56 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.3% 98.5% 0.127s 3.37e-06s C 37733 384 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.3% 98.8% 0.118s 7.43e-06s C 15813 114 theano.compile.ops.DeepCopyOp | |
0.1% 99.0% 0.057s 3.66e-06s C 15701 169 theano.tensor.opt.MakeVector | |
0.1% 99.1% 0.054s 1.60e-05s C 3393 29 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.1% 99.2% 0.050s 4.52e-06s C 11167 119 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 99.4% 0.048s 3.42e-06s C 14141 158 theano.compile.ops.Shape_i | |
0.1% 99.4% 0.034s 5.30e-05s C 648 7 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1 | |
0.1% 99.5% 0.033s 2.96e-06s C 10969 127 theano.tensor.basic.ScalarFromTensor | |
0.1% 99.6% 0.032s 8.55e-05s C 372 5 theano.sandbox.cuda.basic_ops.GpuJoin | |
... (remaining 22 Classes account for 0.38%(0.15s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
25.3% 25.3% 9.712s 6.03e-02s Py 161 2 EditDistanceOp | |
22.7% 48.0% 8.707s 8.71e-02s Py 100 1 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan} | |
13.7% 61.8% 5.270s 3.27e-02s Py 161 2 forall_inplace,gpu,generator_generate_scan} | |
10.7% 72.5% 4.113s 2.06e-02s Py 200 2 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan} | |
8.9% 81.4% 3.412s 3.41e-02s Py 100 1 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan} | |
5.1% 86.5% 1.957s 7.50e-03s Py 261 3 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan} | |
1.7% 88.2% 0.642s 7.29e-05s C 8805 89 GpuDot22 | |
0.8% 88.9% 0.297s 1.72e-05s C 17247 197 HostFromGpu | |
0.7% 89.6% 0.262s 3.12e-05s C 8400 84 GpuCAReduce{pre=sqr,red=add}{1,1} | |
0.6% 90.2% 0.235s 2.12e-05s C 11100 111 GpuElemwise{add,no_inplace} | |
0.5% 90.7% 0.186s 2.12e-05s C 8783 89 GpuElemwise{sub,no_inplace} | |
0.4% 91.1% 0.152s 2.45e-05s Py 6200 39 if{gpu} | |
0.4% 91.5% 0.148s 2.28e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace} | |
0.4% 91.9% 0.143s 2.99e-05s C 4800 48 GpuCAReduce{add}{1,1} | |
0.4% 92.2% 0.138s 2.16e-05s C 6400 64 GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace} | |
0.3% 92.6% 0.128s 1.97e-05s C 6500 65 GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)] | |
0.3% 92.9% 0.128s 1.88e-05s C 6800 68 GpuElemwise{Mul}[(0, 0)] | |
0.3% 93.2% 0.127s 2.15e-05s C 5900 59 GpuElemwise{Switch,no_inplace} | |
0.3% 93.6% 0.126s 1.95e-05s C 6500 65 GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)] | |
0.3% 93.9% 0.121s 2.06e-05s C 5900 59 GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace} | |
... (remaining 321 Ops account for 6.12%(2.35s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name> | |
22.7% 22.7% 8.707s 8.71e-02s 100 2437 forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) | |
input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0) | |
input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) | |
input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) | |
input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) | |
input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
input 27: dtype=int64, shape=(), strides=c | |
input 28: dtype=int64, shape=(), strides=c | |
input 29: dtype=int64, shape=(), strides=c | |
input 30: dtype=int64, shape=(), strides=c | |
input 31: dtype=int64, shape=(), strides=c | |
input 32: dtype=int64, shape=(), strides=c | |
input 33: dtype=int64, shape=(), strides=c | |
input 34: dtype=int64, shape=(), strides=c | |
input 35: dtype=float32, shape=(100, 200), strides=c | |
input 36: dtype=float32, shape=(200, 200), strides=c | |
input 37: dtype=float32, shape=(100, 100), strides=c | |
input 38: dtype=float32, shape=(200, 100), strides=c | |
input 39: dtype=float32, shape=(100, 100), strides=c | |
input 40: dtype=float32, shape=(200, 200), strides=(1, 200) | |
input 41: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 42: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 43: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 44: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 45: dtype=int64, shape=(2,), strides=c | |
input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 47: dtype=int64, shape=(1,), strides=c | |
input 48: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 50: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 51: dtype=int8, shape=(10,), strides=c | |
input 52: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 53: dtype=float32, shape=(100, 200), strides=c | |
input 54: dtype=float32, shape=(200, 200), strides=c | |
input 55: dtype=float32, shape=(100, 100), strides=c | |
input 56: dtype=float32, shape=(200, 100), strides=c | |
input 57: dtype=float32, shape=(100, 100), strides=c | |
input 58: dtype=float32, shape=(200, 200), strides=(1, 200) | |
input 59: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 60: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 61: dtype=float32, shape=(100, 200), strides=(1, 100) | |
input 62: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 63: dtype=int64, shape=(2,), strides=c | |
input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 65: dtype=int64, shape=(1,), strides=c | |
input 66: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 68: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 69: dtype=int8, shape=(10,), strides=c | |
input 70: dtype=float32, shape=(1, 100), strides=(0, 1) | |
output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) | |
output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) | |
output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) | |
output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) | |
output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) | |
output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) | |
output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) | |
22.6% 45.3% 8.684s 1.42e-01s 61 269 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask) | |
input 0: dtype=int64, shape=(15, 75), strides=c | |
input 1: dtype=float32, shape=(15, 75), strides=c | |
input 2: dtype=int64, shape=(12, 75), strides=c | |
input 3: dtype=float32, shape=(12, 75), strides=c | |
output 0: dtype=int64, shape=(15, 75, 1), strides=c | |
8.9% 54.2% 3.412s 3.41e-02s 100 2149 forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) | |
input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) | |
input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
input 12: dtype=float32, shape=(100, 200), strides=c | |
input 13: dtype=float32, shape=(200, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(200, 100), strides=c | |
input 16: dtype=float32, shape=(100, 100), strides=c | |
input 17: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 19: dtype=int64, shape=(1,), strides=c | |
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 21: dtype=int8, shape=(10,), strides=c | |
input 22: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(200, 200), strides=c | |
input 25: dtype=float32, shape=(100, 100), strides=c | |
input 26: dtype=float32, shape=(200, 100), strides=c | |
input 27: dtype=float32, shape=(100, 100), strides=c | |
input 28: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 30: dtype=int64, shape=(1,), strides=c | |
input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 32: dtype=int8, shape=(10,), strides=c | |
input 33: dtype=float32, shape=(100, 1), strides=(1, 0) | |
output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) | |
output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) | |
output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) | |
7.8% 62.0% 2.984s 2.98e-02s 100 1850 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 10), strides=(10, 1) | |
input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(10,), strides=c | |
input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 10), strides=c | |
6.0% 68.0% 2.286s 3.75e-02s 61 260 forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
input 4: dtype=int64, shape=(), strides=c | |
input 5: dtype=float32, shape=(100, 44), strides=c | |
input 6: dtype=float32, shape=(200, 44), strides=c | |
input 7: dtype=float32, shape=(100, 200), strides=c | |
input 8: dtype=float32, shape=(200, 200), strides=c | |
input 9: dtype=float32, shape=(45, 100), strides=c | |
input 10: dtype=float32, shape=(100, 200), strides=c | |
input 11: dtype=float32, shape=(100, 100), strides=c | |
input 12: dtype=float32, shape=(200, 100), strides=c | |
input 13: dtype=float32, shape=(100, 100), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
input 15: dtype=float32, shape=(1, 44), strides=(0, 1) | |
input 16: dtype=float32, shape=(1, 200), strides=(0, 1) | |
input 17: dtype=float32, shape=(1, 100), strides=(0, 1) | |
input 18: dtype=int64, shape=(1,), strides=c | |
input 19: dtype=float32, shape=(12, 75), strides=(75, 1) | |
input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 21: dtype=float32, shape=(100, 1), strides=(1, 0) | |
input 22: dtype=int8, shape=(75,), strides=c | |
input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) | |
output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) | |
output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) | |
output 3: dtype=int64, shape=(15, 75), strides=c | |
5.4% 73.3% 2.057s 2.06e-02s 100 2632 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=int64, shape=(), strides=c | |
input 18: dtype=int64, shape=(), strides=c | |
input 19: dtype=float32, shape=(100, 200), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 22: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
input 25: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 26: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
5.4% 78.7% 2.056s 2.06e-02s 100 2631 forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
input 13: dtype=int64, shape=(), strides=c | |
input 14: dtype=int64, shape=(), strides=c | |
input 15: dtype=int64, shape=(), strides=c | |
input 16: dtype=int64, shape=(), strides=c | |
input 17: dtype=int64, shape=(), strides=c | |
input 18: dtype=int64, shape=(), strides=c | |
input 19: dtype=float32, shape=(100, 200), strides=c | |
input 20: dtype=float32, shape=(100, 100), strides=c | |
input 21: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 22: dtype=float32, shape=(100, 100), strides=(1, 100) | |
input 23: dtype=float32, shape=(100, 200), strides=c | |
input 24: dtype=float32, shape=(100, 100), strides=c | |
input 25: dtype=float32, shape=(200, 100), strides=(1, 200) | |
input 26: dtype=float32, shape=(100, 100), strides=(1, 100) | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) | |
output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) | |
2.7% 81.4% 1.028s 1.03e-02s 100 2005 EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11) | |
input 0: dtype=int64, shape=(15, 10), strides=c | |
input 1: dtype=float32, shape=(15, 10), strides=c | |
input 2: dtype=int64, shape=(12, 10), strides=c | |
input 3: dtype=float32, shape=(12, 10), strides=c | |
output 0: dtype=int64, shape=(15, 10, 1), strides=c | |
1.8% 83.2% 0.696s 6.96e-03s 100 1642 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}. | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
1.8% 85.0% 0.694s 6.94e-03s 100 1652 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) | |
input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) | |
input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) | |
input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) | |
input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) | |
input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) | |
input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) | |
1.5% 86.5% 0.567s 9.29e-03s 61 247 forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) | |
input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) | |
input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) | |
input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) | |
input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) | |
input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
input 11: dtype=float32, shape=(100, 200), strides=c | |
input 12: dtype=float32, shape=(100, 100), strides=c | |
input 13: dtype=float32, shape=(100, 200), strides=c | |
input 14: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) | |
0.1% 86.6% 0.039s 3.52e-03s 11 133 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.1% 86.7% 0.038s 3.46e-03s 11 175 forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state) | |
input 0: dtype=int64, shape=(), strides=c | |
input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) | |
input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) | |
input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
input 4: dtype=float32, shape=(100, 200), strides=c | |
input 5: dtype=float32, shape=(100, 100), strides=c | |
output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) | |
0.1% 86.7% 0.024s 4.01e-06s 6075 0 DeepCopyOp(labels) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
0.0% 86.8% 0.019s 3.10e-04s 61 140 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.0% 86.8% 0.018s 3.03e-04s 61 142 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(900, 100), strides=(100, 1) | |
input 1: dtype=float32, shape=(100, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(900, 200), strides=(200, 1) | |
0.0% 86.9% 0.016s 2.64e-06s 6075 1 DeepCopyOp(inputs) | |
input 0: dtype=int64, shape=(12,), strides=c | |
output 0: dtype=int64, shape=(12,), strides=c | |
0.0% 86.9% 0.013s 1.31e-04s 100 2467 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(200, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(200, 200), strides=(200, 1) | |
0.0% 87.0% 0.013s 1.31e-04s 100 2463 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(200, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(200, 200), strides=(200, 1) | |
0.0% 87.0% 0.013s 1.28e-04s 100 2462 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
input 0: dtype=float32, shape=(100, 150), strides=(150, 1) | |
input 1: dtype=float32, shape=(150, 200), strides=(200, 1) | |
output 0: dtype=float32, shape=(100, 200), strides=(200, 1) | |
... (remaining 4255 Apply instances account for 13.01%(4.99s) of the runtime) | |
Memory Profile (the max between all functions in that profile) | |
(Sparse variables are ignored) | |
(For values in brackets, it's for linker = c|py | |
--- | |
Max peak memory with current setting | |
CPU: 58KB (62KB) | |
GPU: 3739KB (5373KB) | |
CPU + GPU: 3797KB (5435KB) | |
Max peak memory with current setting and Theano flag optimizer_excluding=inplace | |
CPU: 57KB (62KB) | |
GPU: 5605KB (6697KB) | |
CPU + GPU: 5662KB (6758KB) | |
Max peak memory if allow_gc=False (linker don't make a difference) | |
CPU: 114KB | |
GPU: 17091KB | |
CPU + GPU: 17205KB | |
--- | |
This list is based on all functions in the profile | |
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node> | |
1576960B [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0) | |
836280B [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
750480B [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0) | |
737280B [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0) | |
737280B [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1}) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) | |
720000B [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state) | |
720000B [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1}) | |
720000B [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
720000B [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0) | |
720000B [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
488000B [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0) | |
... (remaining 4255 Apply account for 58951889B/73960729B ((79.71%)) of the Apply with dense outputs sizes) | |
<created/inplace/view> is taken from the Op's declaration. | |
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases. | |
Here are tips to potentially make your code run faster | |
(if you think of new ones, suggest them on the mailing list). | |
Test them first, as they are not guaranteed to always provide a speedup. | |
Sorry, no tip for today. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment