rizar · May 6, 2016 15:30
diff --git a/new.prof b/new.prof
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 100 calls to Function.__call__: 2.154827e-03s
  Time in Function.fn.__call__: 9.248257e-04s (42.919%)
  Total compile time: 4.125585e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 6.079912e-03s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 9.608269e-05s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.132s
  No execution time accumulated (hint: try config profiling.time_thunks=1)
 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 11 calls to Function.__call__: 2.018499e-02s
  Time in Function.fn.__call__: 1.745415e-02s (86.471%)
  Time in thunks: 7.772207e-03s (38.505%)
  Total compile time: 4.343552e+00s
    Number of Apply nodes: 43
    Theano Optimizer time: 1.791000e-01s
       Theano validate time: 1.072645e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 6.402516e-02s
       Import time 4.774094e-03s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.132s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.008s       1.64e-05s     C      473      43   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.008s       1.64e-05s     C      473       43   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   4.7%     4.7%       0.000s       3.34e-05s     11     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.8%     7.5%       0.000s       1.99e-05s     11    31                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.7%    10.3%       0.000s       1.93e-05s     11     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.6%    12.9%       0.000s       1.84e-05s     11     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.6%    15.4%       0.000s       1.81e-05s     11    16                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    18.0%       0.000s       1.81e-05s     11    23                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    20.5%       0.000s       1.80e-05s     11     3                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    23.1%       0.000s       1.80e-05s     11    24                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    25.6%       0.000s       1.80e-05s     11     4                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    28.2%       0.000s       1.79e-05s     11    27                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    30.7%       0.000s       1.78e-05s     11    25                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    33.2%       0.000s       1.78e-05s     11     8                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    35.7%       0.000s       1.78e-05s     11     5                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    38.2%       0.000s       1.77e-05s     11    12                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    40.7%       0.000s       1.77e-05s     11     6                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    43.2%       0.000s       1.76e-05s     11    29                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    45.7%       0.000s       1.75e-05s     11    11                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    48.2%       0.000s       1.75e-05s     11     7                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   2.5%    50.6%       0.000s       1.75e-05s     11    32                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    53.1%       0.000s       1.75e-05s     11    13                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 23 Apply instances account for 46.88%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 43 Apply account for  192B/192B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 10 calls to Function.__call__: 1.222110e-02s
  Time in Function.fn.__call__: 1.176500e-02s (96.268%)
  Time in thunks: 4.612923e-03s (37.746%)
  Total compile time: 4.154817e+00s
    Number of Apply nodes: 29
    Theano Optimizer time: 5.256701e-02s
       Theano validate time: 1.211166e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.951882e-02s
       Import time 1.188660e-02s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.137s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  52.9%    52.9%       0.002s       1.63e-05s     C      150      15   theano.sandbox.cuda.basic_ops.HostFromGpu
  43.7%    96.6%       0.002s       2.24e-05s     C       90       9   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.4%   100.0%       0.000s       3.16e-06s     C       50       5   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  52.9%    52.9%       0.002s       1.63e-05s     C      150       15   HostFromGpu
  43.7%    96.6%       0.002s       2.24e-05s     C       90        9   GpuElemwise{true_div,no_inplace}
   3.4%   100.0%       0.000s       3.16e-06s     C       50        5   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  10.0%    10.0%       0.000s       4.61e-05s     10     0                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   5.8%    15.8%       0.000s       2.68e-05s     10    15                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   4.4%    20.2%       0.000s       2.03e-05s     10     1                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.3%    24.5%       0.000s       1.98e-05s     10    12                     GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.3%    28.7%       0.000s       1.96e-05s     10     2                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.2%    32.9%       0.000s       1.93e-05s     10    13                     GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.2%    37.1%       0.000s       1.93e-05s     10     4                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.2%    41.3%       0.000s       1.92e-05s     10     6                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    45.4%       0.000s       1.91e-05s     10     3                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    49.5%       0.000s       1.90e-05s     10     5                     GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.6%    53.1%       0.000s       1.65e-05s     10    16                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    56.6%       0.000s       1.63e-05s     10    26                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    60.1%       0.000s       1.59e-05s     10    20                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    63.5%       0.000s       1.58e-05s     10    19                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    66.9%       0.000s       1.57e-05s     10     7                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    70.3%       0.000s       1.56e-05s     10    17                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    73.7%       0.000s       1.56e-05s     10    27                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    77.1%       0.000s       1.56e-05s     10     8                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    80.4%       0.000s       1.54e-05s     10    18                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    83.7%       0.000s       1.52e-05s     10    14                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 9 Apply instances account for 16.31%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 29 Apply account for  136B/136B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 101 calls to Function.__call__: 1.747441e-02s
  Time in Function.fn.__call__: 1.434040e-02s (82.065%)
  Time in thunks: 2.486944e-03s (14.232%)
  Total compile time: 4.068843e+00s
    Number of Apply nodes: 6
    Theano Optimizer time: 1.878691e-02s
       Theano validate time: 5.388260e-05s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.104212e-02s
       Import time 7.761240e-03s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.140s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  54.7%    54.7%       0.001s       3.37e-06s     C      404       4   theano.compile.ops.Shape_i
  45.3%   100.0%       0.001s       5.58e-06s     C      202       2   theano.tensor.basic.Alloc
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  45.3%    45.3%       0.001s       5.58e-06s     C      202        2   Alloc
  30.7%    76.0%       0.001s       3.78e-06s     C      202        2   Shape_i{1}
  24.0%   100.0%       0.001s       2.95e-06s     C      202        2   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  28.4%    28.4%       0.001s       7.00e-06s    101     4                     Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
    input 0: dtype=int64, shape=(1, 1), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(15, 10), strides=c 
  19.0%    47.5%       0.000s       4.69e-06s    101     0                     Shape_i{1}(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  16.9%    64.3%       0.000s       4.16e-06s    101     5                     Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
    input 0: dtype=int64, shape=(1, 1), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(12, 10), strides=c 
  12.9%    77.2%       0.000s       3.17e-06s    101     1                     Shape_i{0}(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  11.7%    88.9%       0.000s       2.88e-06s    101     2                     Shape_i{1}(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  11.1%   100.0%       0.000s       2.73e-06s    101     3                     Shape_i{0}(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 2KB
        GPU: 0KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          1200B  [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
   ... (remaining 5 Apply account for  992B/2192B ((45.26%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 100 calls to Function.__call__: 1.629472e-02s
  Time in Function.fn.__call__: 1.466155e-02s (89.977%)
  Time in thunks: 9.594440e-03s (58.881%)
  Total compile time: 4.084757e+00s
    Number of Apply nodes: 2
    Theano Optimizer time: 7.371902e-03s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.080990e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.141s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.010s       4.80e-05s     C      200       2   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.010s       4.80e-05s     C      200        2   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  95.0%    95.0%       0.009s       9.11e-05s    100     0                     DeepCopyOp(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10), strides=c 
   5.0%   100.0%       0.000s       4.83e-06s    100     1                     DeepCopyOp(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(12, 10), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 2KB
        GPU: 0KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          1200B  [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction)
   ... (remaining 1 Apply account for  960B/2160B ((44.44%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 2 calls to Function.__call__: 2.764940e-03s
  Time in Function.fn.__call__: 2.352715e-03s (85.091%)
  Time in thunks: 1.017094e-03s (36.785%)
  Total compile time: 4.452709e+00s
    Number of Apply nodes: 31
    Theano Optimizer time: 9.523201e-02s
       Theano validate time: 7.679462e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.307699e-02s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.142s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.001s       1.64e-05s     C       62      31   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.001s       1.64e-05s     C       62       31   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   4.7%     4.7%       0.000s       2.41e-05s      2     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.8%     8.6%       0.000s       1.94e-05s      2     6                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.8%    12.3%       0.000s       1.91e-05s      2    14                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.7%    16.0%       0.000s       1.90e-05s      2     7                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.7%    19.7%       0.000s       1.88e-05s      2     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.7%    23.4%       0.000s       1.86e-05s      2    12                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    27.0%       0.000s       1.85e-05s      2     4                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.6%    30.7%       0.000s       1.85e-05s      2     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.6%    34.2%       0.000s       1.81e-05s      2    16                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    37.8%       0.000s       1.81e-05s      2     9                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    41.4%       0.000s       1.81e-05s      2     8                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.6%    44.9%       0.000s       1.81e-05s      2     5                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.6%    48.5%       0.000s       1.81e-05s      2     3                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   3.5%    52.0%       0.000s       1.80e-05s      2    24                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    55.6%       0.000s       1.80e-05s      2    19                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    59.1%       0.000s       1.80e-05s      2    13                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    62.6%       0.000s       1.79e-05s      2    11                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    66.1%       0.000s       1.79e-05s      2    10                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    69.6%       0.000s       1.75e-05s      2    23                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    73.0%       0.000s       1.75e-05s      2    21                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 11 Apply instances account for 26.98%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 31 Apply account for  140B/140B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 8.559227e-04s
  Time in Function.fn.__call__: 8.108616e-04s (94.735%)
  Time in thunks: 3.142357e-04s (36.713%)
  Total compile time: 4.539160e+00s
    Number of Apply nodes: 21
    Theano Optimizer time: 3.893209e-02s
       Theano validate time: 8.273125e-05s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.924204e-02s
       Import time 2.619028e-03s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.146s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  58.3%    58.3%       0.000s       1.66e-05s     C       11      11   theano.sandbox.cuda.basic_ops.HostFromGpu
  36.9%    95.1%       0.000s       1.93e-05s     C        6       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   4.9%   100.0%       0.000s       3.81e-06s     C        4       4   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  58.3%    58.3%       0.000s       1.66e-05s     C       11       11   HostFromGpu
  36.9%    95.1%       0.000s       1.93e-05s     C        6        6   GpuElemwise{true_div,no_inplace}
   4.9%   100.0%       0.000s       3.81e-06s     C        4        4   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   7.0%     7.0%       0.000s       2.19e-05s      1     0                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.7%    13.7%       0.000s       2.10e-05s      1     1                     GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.4%    20.0%       0.000s       2.00e-05s      1     3                     GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.1%    26.1%       0.000s       1.91e-05s      1     7                     GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.0%    32.1%       0.000s       1.88e-05s      1     8                     GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.0%    38.1%       0.000s       1.88e-05s      1     6                     GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.8%    43.9%       0.000s       1.81e-05s      1    16                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.8%    49.6%       0.000s       1.81e-05s      1     2                     GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.7%    55.3%       0.000s       1.79e-05s      1    12                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.5%    60.8%       0.000s       1.72e-05s      1    11                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.2%    65.9%       0.000s       1.62e-05s      1    13                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.1%    71.0%       0.000s       1.60e-05s      1    17                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.1%    76.1%       0.000s       1.60e-05s      1     5                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.8%    80.9%       0.000s       1.50e-05s      1    18                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.8%    85.7%       0.000s       1.50e-05s      1    10                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.8%    90.4%       0.000s       1.50e-05s      1     4                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.7%    95.1%       0.000s       1.48e-05s      1     9                     HostFromGpu(shared_weights_penalty)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   1.6%    96.7%       0.000s       5.01e-06s      1    19                     Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   1.3%    98.0%       0.000s       4.05e-06s      1    15                     Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   1.0%    99.0%       0.000s       3.10e-06s      1    20                     Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0)
    input 0: dtype=float64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 1 Apply instances account for 0.99%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 21 Apply account for  100B/100B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 1 calls to Function.__call__: 4.639626e-04s
  Time in Function.fn.__call__: 2.970695e-04s (64.029%)
  Time in thunks: 1.273155e-04s (27.441%)
  Total compile time: 4.479136e+00s
    Number of Apply nodes: 5
    Theano Optimizer time: 1.386118e-02s
       Theano validate time: 1.111031e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 6.145954e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.148s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.000s       2.55e-05s     C        5       5   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.000s       2.55e-05s     C        5        5   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  59.7%    59.7%       0.000s       7.61e-05s      1     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  15.7%    75.5%       0.000s       2.00e-05s      1     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  13.5%    89.0%       0.000s       1.72e-05s      1     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   7.9%    96.8%       0.000s       1.00e-05s      1     3                     DeepCopyOp(TensorConstant{0})
    input 0: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   3.2%   100.0%       0.000s       4.05e-06s      1     4                     DeepCopyOp(TensorConstant{0.0})
    input 0: dtype=float64, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 5 Apply account for   28B/28B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 1.401901e-04s
  Time in Function.fn.__call__: 1.139641e-04s (81.293%)
  Time in thunks: 3.004074e-05s (21.429%)
  Total compile time: 4.912266e+00s
    Number of Apply nodes: 3
    Theano Optimizer time: 1.049495e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.658844e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.149s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  66.7%    66.7%       0.000s       2.00e-05s     C        1       1   theano.sandbox.cuda.basic_ops.HostFromGpu
  19.8%    86.5%       0.000s       5.96e-06s     C        1       1   theano.compile.ops.DeepCopyOp
  13.5%   100.0%       0.000s       4.05e-06s     C        1       1   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  66.7%    66.7%       0.000s       2.00e-05s     C        1        1   HostFromGpu
  19.8%    86.5%       0.000s       5.96e-06s     C        1        1   DeepCopyOp
  13.5%   100.0%       0.000s       4.05e-06s     C        1        1   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  66.7%    66.7%       0.000s       2.00e-05s      1     1                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  19.8%    86.5%       0.000s       5.96e-06s      1     0                     DeepCopyOp(shared_batch_size)
    input 0: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  13.5%   100.0%       0.000s       4.05e-06s      1     2                     Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0)
    input 0: dtype=float64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 3 Apply account for   20B/20B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
  Time in 61 calls to Function.__call__: 1.211181e+01s
  Time in Function.fn.__call__: 1.210473e+01s (99.942%)
  Time in thunks: 1.171248e+01s (96.703%)
  Total compile time: 1.925457e+01s
    Number of Apply nodes: 274
    Theano Optimizer time: 5.967708e+00s
       Theano validate time: 2.864373e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 9.222651e+00s
       Import time 3.308520e-01s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.150s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  74.1%    74.1%       8.684s       1.42e-01s     Py      61       1   lvsr.ops.EditDistanceOp
  24.4%    98.5%       2.853s       2.34e-02s     Py     122       2   theano.scan_module.scan_op.Scan
   0.5%    99.0%       0.064s       2.10e-04s     C      305       5   theano.sandbox.cuda.blas.GpuDot22
   0.2%    99.2%       0.023s       4.16e-05s     C      549       9   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.2%    99.4%       0.021s       2.93e-06s     C     7259     119   theano.tensor.elemwise.Elemwise
   0.1%    99.5%       0.012s       1.93e-04s     C       61       1   theano.sandbox.cuda.basic_ops.GpuJoin
   0.1%    99.6%       0.008s       3.45e-05s     C      244       4   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.1%    99.7%       0.007s       2.32e-05s     C      305       5   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.0%    99.7%       0.005s       8.45e-05s     C       61       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   0.0%    99.7%       0.004s       2.97e-06s     C     1464      24   theano.compile.ops.Shape_i
   0.0%    99.8%       0.004s       2.14e-05s     C      183       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.0%    99.8%       0.004s       3.71e-06s     C      976      16   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%    99.8%       0.003s       2.92e-06s     C     1098      18   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%    99.9%       0.003s       2.92e-06s     C     1037      17   theano.tensor.opt.MakeVector
   0.0%    99.9%       0.003s       2.41e-05s     C      122       2   theano.compile.ops.DeepCopyOp
   0.0%    99.9%       0.002s       2.38e-06s     C     1037      17   theano.tensor.basic.ScalarFromTensor
   0.0%    99.9%       0.002s       7.78e-06s     Py     305       3   theano.ifelse.IfElse
   0.0%    99.9%       0.002s       4.07e-06s     C      549       9   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.002s       5.39e-06s     C      305       5   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   0.0%   100.0%       0.001s       6.56e-06s     Py     183       3   theano.compile.ops.Rebroadcast
   ... (remaining 8 Classes account for   0.03%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  74.1%    74.1%       8.684s       1.42e-01s     Py      61        1   EditDistanceOp
  19.5%    93.7%       2.286s       3.75e-02s     Py      61        1   forall_inplace,gpu,generator_generate_scan}
   4.8%    98.5%       0.567s       9.29e-03s     Py      61        1   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   0.5%    99.0%       0.064s       2.10e-04s     C      305        5   GpuDot22
   0.2%    99.2%       0.018s       5.83e-05s     C      305        5   GpuElemwise{Add}[(0, 0)]
   0.1%    99.3%       0.012s       1.93e-04s     C       61        1   GpuJoin
   0.1%    99.4%       0.007s       2.32e-05s     C      305        5   GpuIncSubtensor{InplaceSet;:int64:}
   0.1%    99.4%       0.007s       3.83e-05s     C      183        3   GpuAlloc
   0.0%    99.5%       0.005s       8.45e-05s     C       61        1   GpuAdvancedSubtensor1
   0.0%    99.5%       0.004s       2.14e-05s     C      183        3   HostFromGpu
   0.0%    99.5%       0.004s       2.10e-05s     C      183        3   GpuElemwise{sub,no_inplace}
   0.0%    99.6%       0.003s       2.92e-06s     C     1037       17   MakeVector{dtype='int64'}
   0.0%    99.6%       0.003s       2.41e-05s     C      122        2   DeepCopyOp
   0.0%    99.6%       0.002s       3.71e-06s     C      671       11   GpuReshape{2}
   0.0%    99.6%       0.002s       2.38e-06s     C     1037       17   ScalarFromTensor
   0.0%    99.6%       0.002s       2.81e-06s     C      793       13   Shape_i{0}
   0.0%    99.7%       0.002s       3.16e-06s     C      671       11   Shape_i{1}
   0.0%    99.7%       0.002s       2.76e-06s     C      671       11   Elemwise{add,no_inplace}
   0.0%    99.7%       0.002s       2.72e-06s     C      610       10   Elemwise{sub,no_inplace}
   0.0%    99.7%       0.002s       5.39e-06s     C      305        5   GpuAllocEmpty
   ... (remaining 72 Ops account for   0.30%(0.03s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  74.1%    74.1%       8.684s       1.42e-01s     61   269                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
    input 0: dtype=int64, shape=(15, 75), strides=c 
    input 1: dtype=float32, shape=(15, 75), strides=c 
    input 2: dtype=int64, shape=(12, 75), strides=c 
    input 3: dtype=float32, shape=(12, 75), strides=c 
    output 0: dtype=int64, shape=(15, 75, 1), strides=c 
  19.5%    93.7%       2.286s       3.75e-02s     61   260                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 75), strides=(75, 1) 
    input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(75,), strides=c 
    input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 75), strides=c 
   4.8%    98.5%       0.567s       9.29e-03s     61   247                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) 
    input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) 
    input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.2%    98.7%       0.019s       3.10e-04s     61   140                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.2%    98.8%       0.018s       3.03e-04s     61   142                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.1%    98.9%       0.012s       1.93e-04s     61   255                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.1%    99.0%       0.011s       1.85e-04s     61   257                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.1%    99.1%       0.008s       1.27e-04s     61   139                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.1%    99.1%       0.008s       1.24e-04s     61   141                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.2%       0.005s       8.45e-05s     61    65                     GpuAdvancedSubtensor1(W, Reshape{1}.0)
    input 0: dtype=float32, shape=(44, 100), strides=c 
    input 1: dtype=int64, shape=(900,), strides=c 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.2%       0.005s       7.52e-05s     61   170                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.0%    99.3%       0.005s       7.49e-05s     61   172                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.0%    99.3%       0.003s       4.81e-05s     61   169                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.3%       0.003s       4.72e-05s     61   259                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.3%       0.003s       4.63e-05s     61   171                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.4%       0.002s       4.08e-05s     61    47                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.4%       0.002s       3.73e-05s     61   107                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.4%       0.002s       3.67e-05s     61    59                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.4%       0.002s       3.37e-05s     61   160                     GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
    input 0: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 1: dtype=float32, shape=(1, 92160), strides=(0, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
   0.0%    99.4%       0.002s       2.63e-05s     61     4                     DeepCopyOp(CudaNdarrayConstant{1.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   ... (remaining 254 Apply instances account for 0.56%(0.07s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 22KB (22KB)
        GPU: 3175KB (3660KB)
        CPU + GPU: 3197KB (3682KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 22KB (22KB)
        GPU: 3526KB (4334KB)
        CPU + GPU: 3548KB (4356KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 36KB
        GPU: 5187KB
        CPU + GPU: 5223KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        836280B  [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
        720000B  [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        368640B  [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
        368640B  [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
        368640B  [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
        360000B  [(12, 75, 100)] c GpuAllocEmpty(Elemwise{add,no_inplace}.0, Elemwise{Switch}[(0, 1)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
        360000B  [(900, 100)] c GpuAdvancedSubtensor1(W, Reshape{1}.0)
        360000B  [(12, 75, 100)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
   ... (remaining 254 Apply account for 6802854B/19219614B ((35.40%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 61 calls of the op (for a total of 732 steps) 5.621994e-01s

  Total time spent in calling the VM 5.386684e-01s (95.814%)
  Total overhead (computing slices..) 2.353096e-02s (4.186%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  68.8%    68.8%       0.229s       7.80e-05s     C     2928       4   theano.sandbox.cuda.blas.GpuGemm
  28.4%    97.1%       0.094s       2.15e-05s     C     4392       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   2.9%   100.0%       0.010s       3.25e-06s     C     2928       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  68.8%    68.8%       0.229s       7.80e-05s     C     2928        4   GpuGemm{no_inplace}
  10.5%    79.3%       0.035s       2.38e-05s     C     1464        2   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
   9.2%    88.4%       0.030s       2.08e-05s     C     1464        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   8.7%    97.1%       0.029s       1.98e-05s     C     1464        2   GpuElemwise{mul,no_inplace}
   1.5%    98.7%       0.005s       3.44e-06s     C     1464        2   GpuSubtensor{::, :int64:}
   1.3%   100.0%       0.004s       3.06e-06s     C     1464        2   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  23.0%    23.0%       0.076s       1.04e-04s    732     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
  22.5%    45.5%       0.075s       1.02e-04s    732     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
  11.7%    57.1%       0.039s       5.30e-05s    732    10                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
  11.6%    68.8%       0.039s       5.27e-05s    732    11                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.3%    74.0%       0.018s       2.40e-05s    732    12                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(75, 1), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.2%    79.3%       0.017s       2.36e-05s    732    13                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(75, 1), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   4.6%    83.9%       0.015s       2.09e-05s    732     2                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   4.6%    88.4%       0.015s       2.07e-05s    732     3                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   4.4%    92.8%       0.015s       2.00e-05s    732     8                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   4.3%    97.1%       0.014s       1.96e-05s    732     9                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.8%    97.9%       0.003s       3.46e-06s    732     4                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.8%    98.7%       0.002s       3.41e-06s    732     6                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.7%    99.3%       0.002s       3.10e-06s    732     5                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.7%   100.0%       0.002s       3.01e-06s    732     7                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 146KB (205KB)
        CPU + GPU: 146KB (205KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 146KB (205KB)
        CPU + GPU: 146KB (205KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 293KB
        CPU + GPU: 293KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         60000B  [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
         60000B  [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
         60000B  [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
         60000B  [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
         30000B  [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
         30000B  [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
   ... (remaining 0 Apply account for    0B/540000B ((0.00%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( generator_generate_scan )
 ==================
  Message: None
  Time in 61 calls of the op (for a total of 915 steps) 2.276112e+00s

  Total time spent in calling the VM 2.183355e+00s (95.925%)
  Total overhead (computing slices..) 9.275723e-02s (4.075%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  27.2%    27.2%       0.343s       7.49e-05s     C     4575       5   theano.sandbox.cuda.blas.GpuGemm
  25.6%    52.8%       0.322s       2.70e-05s     C    11895      13   theano.sandbox.cuda.basic_ops.GpuElemwise
  21.5%    74.3%       0.271s       5.92e-05s     C     4575       5   theano.sandbox.cuda.blas.GpuDot22
   8.2%    82.5%       0.103s       2.25e-05s     C     4575       5   theano.sandbox.cuda.basic_ops.GpuCAReduce
   3.2%    85.7%       0.041s       4.44e-05s     C      915       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   3.1%    88.8%       0.039s       4.23e-05s     C      915       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   2.9%    91.7%       0.037s       2.02e-05s     C     1830       2   theano.sandbox.cuda.basic_ops.HostFromGpu
   1.9%    93.6%       0.024s       2.64e-05s     C      915       1   theano.tensor.basic.MaxAndArgmax
   1.1%    94.7%       0.014s       1.51e-05s     C      915       1   theano.sandbox.multinomial.MultinomialFromUniform
   1.1%    95.8%       0.013s       2.43e-06s     C     5490       6   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.0%    96.8%       0.013s       1.43e-05s     C      915       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.8%    97.7%       0.011s       2.31e-06s     C     4575       5   theano.compile.ops.Shape_i
   0.7%    98.4%       0.009s       3.28e-06s     C     2745       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.6%    98.9%       0.007s       1.93e-06s     C     3660       4   theano.tensor.opt.MakeVector
   0.5%    99.4%       0.006s       3.39e-06s     C     1830       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.3%    99.8%       0.004s       2.30e-06s     C     1830       2   theano.tensor.elemwise.Elemwise
   0.2%   100.0%       0.003s       3.25e-06s     C      915       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  27.2%    27.2%       0.343s       7.49e-05s     C     4575        5   GpuGemm{inplace}
  21.5%    48.8%       0.271s       5.92e-05s     C     4575        5   GpuDot22
   7.0%    55.7%       0.088s       4.80e-05s     C     1830        2   GpuElemwise{mul,no_inplace}
   3.5%    59.2%       0.043s       4.75e-05s     C      915        1   GpuElemwise{add,no_inplace}
   3.2%    62.4%       0.041s       4.44e-05s     C      915        1   GpuAdvancedSubtensor1
   3.1%    65.5%       0.039s       4.23e-05s     C      915        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
   2.9%    68.4%       0.037s       2.02e-05s     C     1830        2   HostFromGpu
   2.2%    70.6%       0.028s       3.01e-05s     C      915        1   GpuCAReduce{add}{1,0,0}
   2.2%    72.8%       0.027s       2.98e-05s     C      915        1   GpuElemwise{Tanh}[(0, 0)]
   1.9%    74.7%       0.024s       2.64e-05s     C      915        1   MaxAndArgmax
   1.9%    76.5%       0.023s       2.56e-05s     C      915        1   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   1.8%    78.3%       0.022s       2.42e-05s     C      915        1   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
   1.7%    80.0%       0.021s       2.28e-05s     C      915        1   GpuCAReduce{maximum}{0,1}
   1.6%    81.6%       0.020s       2.24e-05s     C      915        1   GpuCAReduce{maximum}{1,0}
   1.4%    83.0%       0.018s       1.92e-05s     C      915        1   GpuElemwise{Add}[(0, 1)]
   1.4%    84.4%       0.018s       1.92e-05s     C      915        1   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   1.4%    85.7%       0.017s       1.89e-05s     C      915        1   GpuCAReduce{add}{0,1}
   1.4%    87.1%       0.017s       1.89e-05s     C      915        1   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   1.4%    88.5%       0.017s       1.87e-05s     C      915        1   GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
   1.3%    89.8%       0.017s       1.85e-05s     C      915        1   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 20 Ops account for  10.18%(0.13s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   9.7%     9.7%       0.122s       1.34e-04s    915    10                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   7.4%    17.1%       0.093s       1.01e-04s    915     5                     GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   6.6%    23.7%       0.084s       9.14e-05s    915    32                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   5.6%    29.3%       0.071s       7.73e-05s    915    46                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(900, 100), strides=c 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(900, 1), strides=c 
   5.5%    34.8%       0.069s       7.57e-05s    915    56                     GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
    input 0: dtype=float32, shape=(12, 75, 1), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=c 
    output 0: dtype=float32, shape=(12, 75, 200), strides=c 
   4.7%    39.5%       0.059s       6.45e-05s    915    38                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.5%    43.0%       0.043s       4.75e-05s    915    43                     GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=c 
   3.2%    46.2%       0.041s       4.44e-05s    915    29                     GpuAdvancedSubtensor1(W_copy[cuda], argmax)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(75,), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.2%    49.4%       0.040s       4.35e-05s    915     8                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 44), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   3.1%    52.5%       0.039s       4.25e-05s    915    37                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.1%    55.5%       0.039s       4.23e-05s    915    13                     GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(92160,), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(92160,), strides=c 
    output 1: dtype=float32, shape=(75,), strides=c 
   3.0%    58.6%       0.038s       4.17e-05s    915    41                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.0%    61.6%       0.038s       4.14e-05s    915    39                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   2.4%    64.0%       0.031s       3.35e-05s    915     1                     GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   2.2%    66.2%       0.028s       3.01e-05s    915    57                     GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   2.2%    68.4%       0.027s       2.98e-05s    915    45                     GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=c 
    output 0: dtype=float32, shape=(900, 100), strides=c 
   1.9%    70.3%       0.024s       2.64e-05s    915    27                     MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1})
    input 0: dtype=int64, shape=(75, 44), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=int64, shape=(75,), strides=c 
    output 1: dtype=int64, shape=(75,), strides=c 
   1.9%    72.1%       0.023s       2.56e-05s    915    33                     GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   1.8%    73.9%       0.022s       2.42e-05s    915    40                     GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   1.7%    75.6%       0.021s       2.29e-05s    915    25                     HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(75, 44), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   ... (remaining 38 Apply instances account for 24.45%(0.31s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 39KB (39KB)
        GPU: 1151KB (1151KB)
        CPU + GPU: 1190KB (1190KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 39KB (39KB)
        GPU: 1151KB (1151KB)
        CPU + GPU: 1190KB (1190KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 41KB
        GPU: 1709KB
        CPU + GPU: 1750KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        720000B  [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
        368940B  [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
        360000B  [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
        360000B  [(12, 75, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
        360000B  [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
         60000B  [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
         60000B  [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
         60000B  [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
         60000B  [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
         60000B  [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
         30000B  [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
         30000B  [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
         30000B  [(1, 75, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
         30000B  [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
         30000B  [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
         30000B  [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
   ... (remaining 38 Apply account for 158879B/2927819B ((5.43%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103
  Time in 11 calls to Function.__call__: 1.319449e-01s
  Time in Function.fn.__call__: 1.316185e-01s (99.753%)
  Time in thunks: 8.657598e-02s (65.615%)
  Total compile time: 1.813622e+01s
    Number of Apply nodes: 183
    Theano Optimizer time: 4.002905e+00s
       Theano validate time: 1.576922e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.015641e+01s
       Import time 6.427932e-02s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.235s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  88.6%    88.6%       0.077s       3.49e-03s     Py      22       2   theano.scan_module.scan_op.Scan
   3.5%    92.1%       0.003s       2.82e-06s     C     1089      99   theano.tensor.elemwise.Elemwise
   1.5%    93.6%       0.001s       2.96e-05s     C       44       4   theano.sandbox.cuda.blas.GpuDot22
   1.0%    94.6%       0.001s       1.91e-05s     C       44       4   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.7%    95.3%       0.001s       5.69e-05s     C       11       1   theano.sandbox.cuda.basic_ops.GpuJoin
   0.6%    95.9%       0.000s       2.26e-05s     C       22       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.6%    96.5%       0.000s       4.40e-05s     C       11       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   0.5%    97.0%       0.000s       3.59e-06s     C      121      11   theano.sandbox.cuda.basic_ops.GpuReshape
   0.5%    97.5%       0.000s       1.97e-05s     C       22       2   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.5%    98.0%       0.000s       2.91e-06s     C      143      13   theano.compile.ops.Shape_i
   0.4%    98.4%       0.000s       2.84e-06s     C      121      11   theano.tensor.opt.MakeVector
   0.4%    98.7%       0.000s       2.31e-06s     C      132      12   theano.tensor.basic.ScalarFromTensor
   0.3%    99.0%       0.000s       4.05e-06s     C       66       6   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.3%    99.3%       0.000s       2.84e-06s     C       88       8   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    99.6%       0.000s       2.26e-05s     C       11       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.2%    99.8%       0.000s       6.34e-06s     Py      22       2   theano.compile.ops.Rebroadcast
   0.1%    99.9%       0.000s       5.42e-06s     C       22       2   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   0.1%   100.0%       0.000s       5.35e-06s     C       11       1   theano.tensor.basic.Alloc
   0.0%   100.0%       0.000s       3.19e-06s     C       11       1   theano.tensor.basic.Reshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  88.6%    88.6%       0.077s       3.49e-03s     Py      22        2   forall_inplace,gpu,gatedrecurrent_apply_scan}
   1.5%    90.1%       0.001s       2.96e-05s     C       44        4   GpuDot22
   1.0%    91.1%       0.001s       1.91e-05s     C       44        4   GpuElemwise{Add}[(0, 0)]
   0.7%    91.8%       0.001s       5.69e-05s     C       11        1   GpuJoin
   0.6%    92.4%       0.000s       2.26e-05s     C       22        2   GpuAlloc
   0.6%    92.9%       0.000s       4.40e-05s     C       11        1   GpuAdvancedSubtensor1
   0.5%    93.4%       0.000s       1.97e-05s     C       22        2   GpuIncSubtensor{InplaceSet;:int64:}
   0.4%    93.8%       0.000s       2.84e-06s     C      121       11   MakeVector{dtype='int64'}
   0.4%    94.2%       0.000s       2.31e-06s     C      132       12   ScalarFromTensor
   0.3%    94.5%       0.000s       3.66e-06s     C       77        7   GpuReshape{2}
   0.3%    94.8%       0.000s       2.80e-06s     C       99        9   Elemwise{add,no_inplace}
   0.3%    95.1%       0.000s       2.26e-05s     C       11        1   HostFromGpu
   0.3%    95.4%       0.000s       3.05e-06s     C       77        7   Shape_i{0}
   0.3%    95.6%       0.000s       2.60e-06s     C       88        8   Elemwise{le,no_inplace}
   0.2%    95.9%       0.000s       2.94e-06s     C       66        6   GpuDimShuffle{x,x,0}
   0.2%    96.1%       0.000s       2.75e-06s     C       66        6   Shape_i{1}
   0.2%    96.3%       0.000s       2.53e-06s     C       66        6   Elemwise{sub,no_inplace}
   0.2%    96.5%       0.000s       2.98e-06s     C       55        5   Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)]
   0.2%    96.6%       0.000s       3.46e-06s     C       44        4   GpuReshape{3}
   0.2%    96.8%       0.000s       2.60e-06s     C       55        5   Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}
   ... (remaining 54 Ops account for   3.19%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  44.7%    44.7%       0.039s       3.52e-03s     11   133                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
  43.9%    88.6%       0.038s       3.46e-03s     11   175                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.7%    89.3%       0.001s       5.69e-05s     11   181                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.6%    89.9%       0.000s       4.40e-05s     11    26                     GpuAdvancedSubtensor1(W, Reshape{1}.0)
    input 0: dtype=float32, shape=(44, 100), strides=c 
    input 1: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.4%    90.3%       0.000s       3.07e-05s     11    51                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
   0.4%    90.7%       0.000s       3.06e-05s     11    49                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
   0.4%    91.0%       0.000s       2.92e-05s     11    50                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.4%    91.4%       0.000s       2.80e-05s     11    48                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.3%    91.7%       0.000s       2.40e-05s     11    96                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.3%    92.0%       0.000s       2.26e-05s     11   182                     HostFromGpu(GpuJoin.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=c 
   0.3%    92.2%       0.000s       2.12e-05s     11    64                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.3%    92.5%       0.000s       2.05e-05s     11   130                     GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.3%    92.8%       0.000s       2.02e-05s     11    71                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    93.0%       0.000s       1.92e-05s     11    73                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    93.2%       0.000s       1.89e-05s     11   160                     GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    93.5%       0.000s       1.87e-05s     11    72                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.2%    93.7%       0.000s       1.85e-05s     11    74                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.1%    93.8%       0.000s       6.39e-06s     11   125                     Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.1%    93.9%       0.000s       6.35e-06s     11   159                     Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(GE(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 + i3 + i4 + i5), i0)}((Composite{((Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), (i2 - i3), i0)}(Composite{((i0 - (Switch(LT(i1, i2), i2, i1) - i3)) - i4)}(i0, Composite{(((i0 - i1) // i2) + i3)}(i1, i2, i3, i4), i5, i6, i7), 
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int8, shape=(), strides=c 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=int8, shape=(), strides=c 
    input 6: dtype=int8, shape=(), strides=c 
    input 7: dtype=int8, shape=(), strides=c 
    input 8: dtype=int8, shape=(), strides=c 
    input 9: dtype=int8, shape=(), strides=c 
    input 10: dtype=int64, shape=(), strides=c 
    input 11: dtype=int64, shape=(), strides=c 
    input 12: dtype=int8, shape=(), strides=c 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   0.1%    94.0%       0.000s       6.29e-06s     11    91                     Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   ... (remaining 163 Apply instances account for 6.04%(0.01s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 9KB (9KB)
        GPU: 28KB (34KB)
        CPU + GPU: 38KB (43KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 9KB (9KB)
        GPU: 33KB (38KB)
        CPU + GPU: 42KB (48KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 10KB
        GPU: 52KB
        CPU + GPU: 63KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         80000B  [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
         80000B  [(100, 200)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
         40000B  [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
         40000B  [(100, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
          9600B  [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          9600B  [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          9600B  [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
          9600B  [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
          9600B  [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
          9600B  [(12, 1, 200)] c HostFromGpu(GpuJoin.0)
          9600B  [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
          9600B  [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
          4800B  [(12, 1, 100)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
          4800B  [(12, 1, 100)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
          4800B  [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          4800B  [(12, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
   ... (remaining 163 Apply account for 65253B/430053B ((15.17%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 11 calls of the op (for a total of 132 steps) 3.813338e-02s

  Total time spent in calling the VM 3.587055e-02s (94.066%)
  Total overhead (computing slices..) 2.262831e-03s (5.934%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  55.7%    55.7%       0.009s       3.56e-05s     C      264       2   theano.sandbox.cuda.blas.GpuGemm
  39.3%    95.0%       0.007s       1.67e-05s     C      396       3   theano.sandbox.cuda.basic_ops.GpuElemwise
   5.0%   100.0%       0.001s       3.22e-06s     C      264       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  55.7%    55.7%       0.009s       3.56e-05s     C      264        2   GpuGemm{no_inplace}
  13.4%    69.1%       0.002s       1.71e-05s     C      132        1   GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
  13.0%    82.1%       0.002s       1.66e-05s     C      132        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
  12.9%    95.0%       0.002s       1.64e-05s     C      132        1   GpuElemwise{mul,no_inplace}
   2.7%    97.6%       0.000s       3.42e-06s     C      132        1   GpuSubtensor{::, :int64:}
   2.4%   100.0%       0.000s       3.02e-06s     C      132        1   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  29.8%    29.8%       0.005s       3.80e-05s    132     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
  25.9%    55.7%       0.004s       3.31e-05s    132     5                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
  13.4%    69.1%       0.002s       1.71e-05s    132     6                     GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
  13.0%    82.1%       0.002s       1.66e-05s    132     1                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
  12.9%    95.0%       0.002s       1.64e-05s    132     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.7%    97.6%       0.000s       3.42e-06s    132     2                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.4%   100.0%       0.000s       3.02e-06s    132     3                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 11 calls of the op (for a total of 132 steps) 3.749466e-02s

  Total time spent in calling the VM 3.560066e-02s (94.949%)
  Total overhead (computing slices..) 1.893997e-03s (5.051%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  55.9%    55.9%       0.009s       3.55e-05s     C      264       2   theano.sandbox.cuda.blas.GpuGemm
  39.2%    95.0%       0.007s       1.66e-05s     C      396       3   theano.sandbox.cuda.basic_ops.GpuElemwise
   5.0%   100.0%       0.001s       3.18e-06s     C      264       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  55.9%    55.9%       0.009s       3.55e-05s     C      264        2   GpuGemm{no_inplace}
  13.3%    69.1%       0.002s       1.69e-05s     C      132        1   GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
  13.0%    82.1%       0.002s       1.65e-05s     C      132        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
  12.9%    95.0%       0.002s       1.65e-05s     C      132        1   GpuElemwise{mul,no_inplace}
   2.6%    97.6%       0.000s       3.31e-06s     C      132        1   GpuSubtensor{::, :int64:}
   2.4%   100.0%       0.000s       3.04e-06s     C      132        1   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  29.8%    29.8%       0.005s       3.79e-05s    132     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  26.1%    55.9%       0.004s       3.32e-05s    132     5                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  13.3%    69.1%       0.002s       1.69e-05s    132     6                     GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  13.0%    82.1%       0.002s       1.65e-05s    132     1                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  12.9%    95.0%       0.002s       1.65e-05s    132     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   2.6%    97.6%       0.000s       3.31e-06s    132     2                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   2.4%   100.0%       0.000s       3.04e-06s    132     3                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111
  Time in 11 calls to Function.__call__: 2.414465e-03s
  Time in Function.fn.__call__: 2.146721e-03s (88.911%)
  Time in thunks: 4.596710e-04s (19.038%)
  Total compile time: 5.729262e+00s
    Number of Apply nodes: 8
    Theano Optimizer time: 3.657293e-02s
       Theano validate time: 4.487038e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.374197e-02s
       Import time 5.259037e-03s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.290s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  43.0%    43.0%       0.000s       1.80e-05s     C       11       1   theano.sandbox.cuda.basic_ops.HostFromGpu
  20.6%    63.6%       0.000s       4.30e-06s     C       22       2   theano.tensor.basic.Alloc
  14.9%    78.5%       0.000s       3.12e-06s     C       22       2   theano.compile.ops.Shape_i
   7.2%    85.7%       0.000s       3.01e-06s     C       11       1   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   7.2%    92.9%       0.000s       2.99e-06s     C       11       1   theano.sandbox.cuda.basic_ops.GpuReshape
   7.1%   100.0%       0.000s       2.97e-06s     C       11       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  43.0%    43.0%       0.000s       1.80e-05s     C       11        1   HostFromGpu
  20.6%    63.6%       0.000s       4.30e-06s     C       22        2   Alloc
  14.9%    78.5%       0.000s       3.12e-06s     C       22        2   Shape_i{0}
   7.2%    85.7%       0.000s       3.01e-06s     C       11        1   GpuDimShuffle{x,x,0}
   7.2%    92.9%       0.000s       2.99e-06s     C       11        1   GpuReshape{2}
   7.1%   100.0%       0.000s       2.97e-06s     C       11        1   MakeVector{dtype='int64'}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  43.0%    43.0%       0.000s       1.80e-05s     11     7                     HostFromGpu(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  10.8%    53.8%       0.000s       4.53e-06s     11     4                     Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 12), strides=c 
   9.8%    63.6%       0.000s       4.07e-06s     11     1                     Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200})
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int16, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
   8.5%    72.0%       0.000s       3.53e-06s     11     0                     Shape_i{0}(generator_generate_attended)
    input 0: dtype=float32, shape=(12, 1, 200), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   7.2%    79.3%       0.000s       3.01e-06s     11     3                     GpuDimShuffle{x,x,0}(initial_state)
    input 0: dtype=float32, shape=(100,), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   7.2%    86.4%       0.000s       2.99e-06s     11     6                     GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   7.1%    93.5%       0.000s       2.97e-06s     11     5                     MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(2,), strides=c 
   6.5%   100.0%       0.000s       2.71e-06s     11     2                     Shape_i{0}(initial_state)
    input 0: dtype=float32, shape=(100,), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 1KB (1KB)
        GPU: 0KB (0KB)
        CPU + GPU: 1KB (1KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 1KB (1KB)
        GPU: 0KB (0KB)
        CPU + GPU: 1KB (1KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 1KB
        GPU: 0KB
        CPU + GPU: 1KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126
  Time in 176 calls to Function.__call__: 4.031258e-01s
  Time in Function.fn.__call__: 3.963535e-01s (98.320%)
  Time in thunks: 1.376257e-01s (34.140%)
  Total compile time: 6.464948e+00s
    Number of Apply nodes: 75
    Theano Optimizer time: 4.475892e-01s
       Theano validate time: 2.268028e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.257081e-01s
       Import time 3.001761e-02s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.292s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  22.7%    22.7%       0.031s       1.77e-05s     C     1760      10   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.7%    40.4%       0.024s       2.77e-05s     C      880       5   theano.sandbox.cuda.blas.GpuDot22
  14.8%    55.3%       0.020s       2.90e-05s     C      704       4   theano.sandbox.cuda.blas.GpuGemm
   8.6%    63.9%       0.012s       1.34e-05s     C      880       5   theano.sandbox.cuda.basic_ops.GpuFromHost
   8.1%    72.0%       0.011s       1.58e-05s     C      704       4   theano.sandbox.cuda.basic_ops.HostFromGpu
   7.6%    79.6%       0.011s       1.99e-05s     C      528       3   theano.sandbox.cuda.basic_ops.GpuCAReduce
   5.3%    84.9%       0.007s       4.17e-05s     C      176       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   3.0%    87.9%       0.004s       2.90e-06s     C     1408       8   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   2.9%    90.8%       0.004s       3.28e-06s     C     1232       7   theano.sandbox.cuda.basic_ops.GpuReshape
   2.8%    93.6%       0.004s       2.43e-06s     C     1584       9   theano.tensor.elemwise.Elemwise
   2.6%    96.2%       0.004s       2.54e-06s     C     1408       8   theano.compile.ops.Shape_i
   2.3%    98.5%       0.003s       2.52e-06s     C     1232       7   theano.tensor.opt.MakeVector
   0.9%    99.4%       0.001s       3.37e-06s     C      352       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.3%    99.7%       0.000s       2.54e-06s     C      176       1   theano.tensor.elemwise.All
   0.3%   100.0%       0.000s       2.43e-06s     C      176       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  17.7%    17.7%       0.024s       2.77e-05s     C      880        5   GpuDot22
  14.8%    32.6%       0.020s       2.90e-05s     C      704        4   GpuGemm{inplace}
   8.6%    41.2%       0.012s       1.34e-05s     C      880        5   GpuFromHost
   8.1%    49.3%       0.011s       1.58e-05s     C      704        4   HostFromGpu
   5.3%    54.6%       0.007s       4.17e-05s     C      176        1   GpuAdvancedSubtensor1
   2.9%    57.5%       0.004s       2.23e-05s     C      176        1   GpuCAReduce{maximum}{1,0}
   2.5%    59.9%       0.003s       3.22e-06s     C     1056        6   GpuReshape{2}
   2.4%    62.4%       0.003s       1.91e-05s     C      176        1   GpuCAReduce{add}{1,0,0}
   2.4%    64.8%       0.003s       1.90e-05s     C      176        1   GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)]
   2.4%    67.2%       0.003s       1.87e-05s     C      176        1   GpuElemwise{Mul}[(0, 1)]
   2.4%    69.5%       0.003s       1.84e-05s     C      176        1   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)]
   2.3%    71.9%       0.003s       1.84e-05s     C      176        1   GpuElemwise{mul,no_inplace}
   2.3%    74.2%       0.003s       1.83e-05s     C      176        1   GpuCAReduce{add}{1,0}
   2.3%    76.5%       0.003s       1.78e-05s     C      176        1   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   2.3%    78.8%       0.003s       2.52e-06s     C     1232        7   MakeVector{dtype='int64'}
   2.2%    81.0%       0.003s       1.73e-05s     C      176        1   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   2.2%    83.2%       0.003s       1.71e-05s     C      176        1   GpuElemwise{Sub}[(0, 1)]
   2.2%    85.4%       0.003s       1.71e-05s     C      176        1   GpuElemwise{Add}[(0, 0)]
   2.2%    87.5%       0.003s       1.69e-05s     C      176        1   GpuElemwise{TrueDiv}[(0, 0)]
   2.1%    89.7%       0.003s       1.67e-05s     C      176        1   GpuElemwise{Tanh}[(0, 0)]
   ... (remaining 21 Ops account for  10.34%(0.01s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.3%     5.3%       0.007s       4.17e-05s    176    10                     GpuAdvancedSubtensor1(W, readout_sample_samples)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(1,), strides=(16,) 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.5%     9.9%       0.006s       3.54e-05s    176    26                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   4.3%    14.2%       0.006s       3.36e-05s    176    34                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   3.8%    18.0%       0.005s       3.01e-05s    176    44                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   3.8%    21.8%       0.005s       2.96e-05s    176    21                     GpuDot22(GpuFromHost.0, state_to_gates)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   3.4%    25.2%       0.005s       2.68e-05s    176    30                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   3.3%    28.5%       0.005s       2.60e-05s    176    43                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   3.2%    31.8%       0.004s       2.54e-05s    176    47                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   3.1%    34.9%       0.004s       2.42e-05s    176    53                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   3.0%    37.9%       0.004s       2.38e-05s    176    45                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.9%    40.8%       0.004s       2.23e-05s    176    55                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   2.4%    43.2%       0.003s       1.91e-05s    176    73                     GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   2.4%    45.6%       0.003s       1.90e-05s    176    50                     GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=c 
    input 2: dtype=float32, shape=(1, 1, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=c 
   2.4%    48.0%       0.003s       1.87e-05s    176    72                     GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
    input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0) 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   2.4%    50.4%       0.003s       1.84e-05s    176    46                     GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.3%    52.7%       0.003s       1.84e-05s    176    42                     GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.3%    55.1%       0.003s       1.83e-05s    176    59                     GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   2.3%    57.4%       0.003s       1.78e-05s    176    57                     GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    input 2: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   2.2%    59.6%       0.003s       1.73e-05s    176    35                     GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 200), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   2.2%    61.8%       0.003s       1.71e-05s    176    58                     GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[[ 1.]]}, GpuFromHost.0)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   ... (remaining 55 Apply instances account for 38.24%(0.05s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 1KB (1KB)
        GPU: 14KB (16KB)
        CPU + GPU: 15KB (18KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 1KB (1KB)
        GPU: 14KB (16KB)
        CPU + GPU: 15KB (18KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 1KB
        GPU: 18KB
        CPU + GPU: 20KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         80000B  [(200, 100)] v GpuReshape{2}(W, MakeVector{dtype='int64'}.0)
          9600B  [(12, 200)] v GpuReshape{2}(GpuFromHost.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] c GpuFromHost(generator_generate_attended)
          9600B  [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
          4800B  [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
          4800B  [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
          4800B  [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
   ... (remaining 66 Apply account for 13555B/146355B ((9.26%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137
  Time in 176 calls to Function.__call__: 9.610200e-02s
  Time in Function.fn.__call__: 9.091020e-02s (94.598%)
  Time in thunks: 3.702688e-02s (38.529%)
  Total compile time: 4.753222e+00s
    Number of Apply nodes: 14
    Theano Optimizer time: 8.387494e-02s
       Theano validate time: 2.176523e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.531886e-02s
       Import time 3.646135e-03s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.305s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  32.1%    32.1%       0.012s       1.69e-05s     C      704       4   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.9%    50.0%       0.007s       1.88e-05s     C      352       2   theano.sandbox.cuda.basic_ops.GpuCAReduce
  12.9%    62.9%       0.005s       1.36e-05s     C      352       2   theano.sandbox.cuda.basic_ops.GpuFromHost
  12.9%    75.8%       0.005s       2.71e-05s     C      176       1   theano.sandbox.cuda.blas.GpuGemm
  12.4%    88.2%       0.005s       2.61e-05s     C      176       1   theano.sandbox.cuda.blas.GpuDot22
   7.8%    96.0%       0.003s       1.64e-05s     C      176       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   4.0%   100.0%       0.001s       2.80e-06s     C      528       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  12.9%    12.9%       0.005s       1.36e-05s     C      352        2   GpuFromHost
  12.9%    25.8%       0.005s       2.71e-05s     C      176        1   GpuGemm{inplace}
  12.4%    38.2%       0.005s       2.61e-05s     C      176        1   GpuDot22
   9.4%    47.6%       0.003s       1.98e-05s     C      176        1   GpuCAReduce{maximum}{0,1}
   8.7%    56.3%       0.003s       1.83e-05s     C      176        1   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   8.5%    64.8%       0.003s       1.79e-05s     C      176        1   GpuCAReduce{add}{0,1}
   7.9%    72.8%       0.003s       1.67e-05s     C      176        1   GpuElemwise{Add}[(0, 1)]
   7.8%    80.6%       0.003s       1.64e-05s     C      176        1   HostFromGpu
   7.8%    88.3%       0.003s       1.64e-05s     C      176        1   GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
   7.7%    96.0%       0.003s       1.61e-05s     C      176        1   GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)]
   2.5%    98.5%       0.001s       2.65e-06s     C      352        2   GpuDimShuffle{0,x}
   1.5%   100.0%       0.001s       3.10e-06s     C      176        1   GpuDimShuffle{x,0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  12.9%    12.9%       0.005s       2.71e-05s    176     4                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
  12.4%    25.3%       0.005s       2.61e-05s    176     3                     GpuDot22(GpuFromHost.0, W)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   9.4%    34.7%       0.003s       1.98e-05s    176     6                     GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   8.7%    43.4%       0.003s       1.83e-05s    176     8                     GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=c 
   8.5%    51.9%       0.003s       1.79e-05s    176     9                     GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0)
    input 0: dtype=float32, shape=(1, 44), strides=c 
    output 0: dtype=float32, shape=(1,), strides=c 
   7.9%    59.8%       0.003s       1.67e-05s    176     5                     GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   7.8%    67.6%       0.003s       1.64e-05s    176    13                     HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 44), strides=c 
   7.8%    75.4%       0.003s       1.64e-05s    176    11                     GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 1), strides=c 
   7.7%    83.1%       0.003s       1.61e-05s    176    12                     GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   7.5%    90.6%       0.003s       1.58e-05s    176     0                     GpuFromHost(generator_generate_weighted_averages)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   5.4%    96.0%       0.002s       1.15e-05s    176     1                     GpuFromHost(generator_generate_states)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   1.5%    97.5%       0.001s       3.10e-06s    176     2                     GpuDimShuffle{x,0}(b)
    input 0: dtype=float32, shape=(44,), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   1.3%    98.7%       0.000s       2.67e-06s    176    10                     GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0)
    input 0: dtype=float32, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(1, 1), strides=c 
   1.3%   100.0%       0.000s       2.63e-06s    176     7                     GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0)
    input 0: dtype=float32, shape=(1,), strides=(0,) 
    output 0: dtype=float32, shape=(1, 1), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 1KB (1KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 1KB (1KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 1.907349e-05s
  Time in Function.fn.__call__: 5.006790e-06s (26.250%)
  Total compile time: 5.178439e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 5.979061e-03s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 9.393692e-05s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.307s
  No execution time accumulated (hint: try config profiling.time_thunks=1)
 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
  Time in 6075 calls to Function.__call__: 3.723266e-01s
  Time in Function.fn.__call__: 2.196813e-01s (59.002%)
  Time in thunks: 4.040527e-02s (10.852%)
  Total compile time: 3.941077e+00s
    Number of Apply nodes: 2
    Theano Optimizer time: 7.288933e-03s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.483917e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.307s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.040s       3.33e-06s     C    12150       2   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.040s       3.33e-06s     C     12150        2   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  60.4%    60.4%       0.024s       4.01e-06s   6075     0                     DeepCopyOp(labels)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
  39.6%   100.0%       0.016s       2.64e-06s   6075     1                     DeepCopyOp(inputs)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 2 Apply account for  192B/192B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253
  Time in 100 calls to Function.__call__: 8.755362e+01s
  Time in Function.fn.__call__: 8.736853e+01s (99.789%)
  Time in thunks: 2.631522e+01s (30.056%)
  Total compile time: 2.758291e+02s
    Number of Apply nodes: 3579
    Theano Optimizer time: 1.544500e+02s
       Theano validate time: 5.072355e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.115705e+02s
       Import time 1.638190e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 673.308s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  78.3%    78.3%      20.607s       2.94e-02s     Py     700       7   theano.scan_module.scan_op.Scan
   6.5%    84.8%       1.718s       2.05e-05s     C    83700     837   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.9%    88.7%       1.028s       1.03e-02s     Py     100       1   lvsr.ops.EditDistanceOp
   2.5%    91.3%       0.661s       2.67e-05s     C    24700     247   theano.sandbox.cuda.basic_ops.GpuCAReduce
   2.1%    93.3%       0.548s       7.40e-05s     C     7400      74   theano.sandbox.cuda.blas.GpuDot22
   1.4%    94.7%       0.367s       3.68e-06s     C    99700     997   theano.tensor.elemwise.Elemwise
   1.1%    95.8%       0.276s       1.73e-05s     C    16000     160   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.6%    96.4%       0.164s       2.28e-05s     Py    7200      48   theano.ifelse.IfElse
   0.6%    97.0%       0.153s       2.74e-05s     C     5600      56   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.5%    97.5%       0.134s       8.20e-06s     C    16300     163   theano.sandbox.cuda.basic_ops.GpuReshape
   0.5%    98.0%       0.129s       2.58e-05s     C     5000      50   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.4%    98.4%       0.118s       3.42e-06s     C    34600     346   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.2%    98.6%       0.056s       1.99e-05s     C     2800      28   theano.compile.ops.DeepCopyOp
   0.2%    98.8%       0.051s       3.83e-06s     C    13300     133   theano.tensor.opt.MakeVector
   0.2%    99.0%       0.047s       4.59e-06s     C    10200     102   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.1%    99.2%       0.039s       3.61e-06s     C    10700     107   theano.compile.ops.Shape_i
   0.1%    99.3%       0.037s       1.75e-05s     C     2100      21   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.1%    99.4%       0.031s       1.02e-04s     Py     300       3   theano.sandbox.cuda.basic_ops.GpuSplit
   0.1%    99.5%       0.030s       3.04e-06s     C     9800      98   theano.tensor.basic.ScalarFromTensor
   0.1%    99.6%       0.021s       5.34e-05s     C      400       4   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   ... (remaining 21 Classes account for   0.39%(0.10s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  33.1%    33.1%       8.707s       8.71e-02s     Py     100        1   forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
  15.6%    48.7%       4.113s       2.06e-02s     Py     200        2   forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
  13.0%    61.7%       3.412s       3.41e-02s     Py     100        1   forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
  11.3%    73.0%       2.984s       2.98e-02s     Py     100        1   forall_inplace,gpu,generator_generate_scan}
   5.3%    78.3%       1.390s       6.95e-03s     Py     200        2   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   3.9%    82.2%       1.028s       1.03e-02s     Py     100        1   EditDistanceOp
   2.1%    84.3%       0.548s       7.40e-05s     C     7400       74   GpuDot22
   1.1%    85.3%       0.276s       1.73e-05s     C     16000      160   HostFromGpu
   1.0%    86.3%       0.262s       3.12e-05s     C     8400       84   GpuCAReduce{pre=sqr,red=add}{1,1}
   0.9%    87.2%       0.235s       2.12e-05s     C     11100      111   GpuElemwise{add,no_inplace}
   0.7%    87.9%       0.182s       2.12e-05s     C     8600       86   GpuElemwise{sub,no_inplace}
   0.6%    88.5%       0.152s       2.45e-05s     Py    6200       39   if{gpu}
   0.6%    89.1%       0.148s       2.28e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
   0.5%    89.6%       0.143s       2.99e-05s     C     4800       48   GpuCAReduce{add}{1,1}
   0.5%    90.1%       0.138s       2.16e-05s     C     6400       64   GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
   0.5%    90.6%       0.128s       1.97e-05s     C     6500       65   GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
   0.5%    91.1%       0.128s       1.88e-05s     C     6800       68   GpuElemwise{Mul}[(0, 0)]
   0.5%    91.6%       0.127s       2.15e-05s     C     5900       59   GpuElemwise{Switch,no_inplace}
   0.5%    92.1%       0.126s       1.95e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
   0.5%    92.5%       0.121s       2.06e-05s     C     5900       59   GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace}
   ... (remaining 251 Ops account for   7.47%(1.96s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  33.1%    33.1%       8.707s       8.71e-02s    100   2437                     forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, 
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) 
    input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0) 
    input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) 
    input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    input 27: dtype=int64, shape=(), strides=c 
    input 28: dtype=int64, shape=(), strides=c 
    input 29: dtype=int64, shape=(), strides=c 
    input 30: dtype=int64, shape=(), strides=c 
    input 31: dtype=int64, shape=(), strides=c 
    input 32: dtype=int64, shape=(), strides=c 
    input 33: dtype=int64, shape=(), strides=c 
    input 34: dtype=int64, shape=(), strides=c 
    input 35: dtype=float32, shape=(100, 200), strides=c 
    input 36: dtype=float32, shape=(200, 200), strides=c 
    input 37: dtype=float32, shape=(100, 100), strides=c 
    input 38: dtype=float32, shape=(200, 100), strides=c 
    input 39: dtype=float32, shape=(100, 100), strides=c 
    input 40: dtype=float32, shape=(200, 200), strides=(1, 200) 
    input 41: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 42: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 43: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 44: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 45: dtype=int64, shape=(2,), strides=c 
    input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 47: dtype=int64, shape=(1,), strides=c 
    input 48: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 50: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 51: dtype=int8, shape=(10,), strides=c 
    input 52: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 53: dtype=float32, shape=(100, 200), strides=c 
    input 54: dtype=float32, shape=(200, 200), strides=c 
    input 55: dtype=float32, shape=(100, 100), strides=c 
    input 56: dtype=float32, shape=(200, 100), strides=c 
    input 57: dtype=float32, shape=(100, 100), strides=c 
    input 58: dtype=float32, shape=(200, 200), strides=(1, 200) 
    input 59: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 60: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 61: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 62: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 63: dtype=int64, shape=(2,), strides=c 
    input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 65: dtype=int64, shape=(1,), strides=c 
    input 66: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 68: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 69: dtype=int8, shape=(10,), strides=c 
    input 70: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) 
    output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) 
  13.0%    46.1%       3.412s       3.41e-02s    100   2149                     forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    input 12: dtype=float32, shape=(100, 200), strides=c 
    input 13: dtype=float32, shape=(200, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(200, 100), strides=c 
    input 16: dtype=float32, shape=(100, 100), strides=c 
    input 17: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 19: dtype=int64, shape=(1,), strides=c 
    input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 21: dtype=int8, shape=(10,), strides=c 
    input 22: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(200, 200), strides=c 
    input 25: dtype=float32, shape=(100, 100), strides=c 
    input 26: dtype=float32, shape=(200, 100), strides=c 
    input 27: dtype=float32, shape=(100, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 30: dtype=int64, shape=(1,), strides=c 
    input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 32: dtype=int8, shape=(10,), strides=c 
    input 33: dtype=float32, shape=(100, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
  11.3%    57.4%       2.984s       2.98e-02s    100   1850                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(10,), strides=c 
    input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 10), strides=c 
   7.8%    65.2%       2.057s       2.06e-02s    100   2632                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=int64, shape=(), strides=c 
    input 18: dtype=int64, shape=(), strides=c 
    input 19: dtype=float32, shape=(100, 200), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 22: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    input 25: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 26: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
    output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
   7.8%    73.0%       2.056s       2.06e-02s    100   2631                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=int64, shape=(), strides=c 
    input 18: dtype=int64, shape=(), strides=c 
    input 19: dtype=float32, shape=(100, 200), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 22: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    input 25: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 26: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
    output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
   3.9%    76.9%       1.028s       1.03e-02s    100   2005                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    input 1: dtype=float32, shape=(15, 10), strides=c 
    input 2: dtype=int64, shape=(12, 10), strides=c 
    input 3: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10, 1), strides=c 
   2.6%    79.6%       0.696s       6.96e-03s    100   1642                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
   2.6%    82.2%       0.694s       6.94e-03s    100   1652                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
   0.0%    82.3%       0.013s       1.31e-04s    100   2467                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(200, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(200, 200), strides=(200, 1) 
   0.0%    82.3%       0.013s       1.31e-04s    100   2463                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(200, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(200, 200), strides=(200, 1) 
   0.0%    82.4%       0.013s       1.28e-04s    100   2462                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   0.0%    82.4%       0.012s       1.25e-04s    100   2468                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   0.0%    82.5%       0.012s       1.24e-04s    100   2547                     GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 150), strides=(1, 100) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   0.0%    82.5%       0.012s       1.19e-04s    100   1117                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(120, 200), strides=(200, 1) 
   0.0%    82.5%       0.012s       1.16e-04s    100   2486                     GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(1, 100) 
    output 0: dtype=float32, shape=(120, 200), strides=(200, 1) 
   0.0%    82.6%       0.012s       1.16e-04s    100   2540                     GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 150), strides=(1, 100) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   0.0%    82.6%       0.012s       1.16e-04s    100   2588                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
   0.0%    82.7%       0.012s       1.15e-04s    100   1143                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(120, 200), strides=(200, 1) 
   0.0%    82.7%       0.011s       1.10e-04s    100   2590                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
   0.0%    82.8%       0.011s       1.09e-04s    100   2664                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 120), strides=(120, 1) 
    input 1: dtype=float32, shape=(120, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   ... (remaining 3559 Apply instances account for 17.24%(4.54s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 58KB (62KB)
        GPU: 3739KB (5373KB)
        CPU + GPU: 3797KB (5435KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 57KB (62KB)
        GPU: 5605KB (6697KB)
        CPU + GPU: 5662KB (6758KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 114KB
        GPU: 17091KB
        CPU + GPU: 17205KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

       1576960B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
        750480B  [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
        391680B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200)] i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0)
        368640B  [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
        368640B  [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
        368640B  [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
        192000B  [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0)
        192000B  [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i4 + i5)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i6)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0)
        160000B  [(200, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        160000B  [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{eq,no_inplace}.0)
        160000B  [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
        160000B  [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
        160000B  [(200, 200)] i GpuElemwise{Sub}[(0, 0)](W, GpuElemwise{Switch,no_inplace}.0)
        160000B  [(200, 200)] c GpuElemwise{Switch,no_inplace}(GpuElemwise{Composite{Cast{float32}(GT((IsNan(i0) + IsInf(i0)), i1))}}[(0, 0)].0, W, GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
        160000B  [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0)
        160000B  [(200, 200)] i GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)](GpuDimShuffle{x,x}.0, GpuElemwise{Mul}[(0, 0)].0, GpuDimShuffle{x,x}.0, variance)
   ... (remaining 3559 Apply account for 46215415B/54155015B ((85.34%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 6.864700e-01s

  Total time spent in calling the VM 6.679530e-01s (97.303%)
  Total overhead (computing slices..) 1.851702e-02s (2.697%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  53.4%    53.4%       0.172s       3.59e-05s     C     4800       4   theano.sandbox.cuda.blas.GpuGemm
  41.7%    95.1%       0.134s       1.87e-05s     C     7200       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   4.9%   100.0%       0.016s       3.30e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  53.4%    53.4%       0.172s       3.59e-05s     C     4800        4   GpuGemm{no_inplace}
  15.4%    68.8%       0.050s       2.07e-05s     C     2400        2   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
  13.5%    82.3%       0.043s       1.81e-05s     C     2400        2   GpuElemwise{mul,no_inplace}
  12.8%    95.1%       0.041s       1.72e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   2.6%    97.7%       0.008s       3.48e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   2.3%   100.0%       0.008s       3.13e-06s     C     2400        2   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  14.5%    14.5%       0.047s       3.90e-05s   1200     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  13.8%    28.3%       0.045s       3.72e-05s   1200     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  12.6%    40.9%       0.041s       3.38e-05s   1200    10                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
  12.5%    53.4%       0.040s       3.37e-05s   1200    11                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.8%    61.2%       0.025s       2.09e-05s   1200    12                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.6%    68.8%       0.024s       2.04e-05s   1200    13                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.9%    75.7%       0.022s       1.84e-05s   1200     8                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.6%    82.3%       0.021s       1.78e-05s   1200     9                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.5%    88.7%       0.021s       1.74e-05s   1200     2                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   6.4%    95.1%       0.021s       1.71e-05s   1200     3                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.3%    96.4%       0.004s       3.56e-06s   1200     4                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.3%    97.7%       0.004s       3.40e-06s   1200     6                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.2%    98.9%       0.004s       3.22e-06s   1200     5                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.1%   100.0%       0.004s       3.04e-06s   1200     7                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 39KB
        CPU + GPU: 39KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
   ... (remaining 0 Apply account for    0B/72000B ((0.00%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 6.850390e-01s

  Total time spent in calling the VM 6.670289e-01s (97.371%)
  Total overhead (computing slices..) 1.801014e-02s (2.629%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  53.5%    53.5%       0.172s       3.59e-05s     C     4800       4   theano.sandbox.cuda.blas.GpuGemm
  41.6%    95.1%       0.134s       1.86e-05s     C     7200       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   4.9%   100.0%       0.016s       3.28e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  53.5%    53.5%       0.172s       3.59e-05s     C     4800        4   GpuGemm{no_inplace}
  15.3%    68.8%       0.049s       2.05e-05s     C     2400        2   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
  13.5%    82.2%       0.043s       1.81e-05s     C     2400        2   GpuElemwise{mul,no_inplace}
  12.9%    95.1%       0.041s       1.73e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   2.6%    97.7%       0.008s       3.48e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   2.3%   100.0%       0.007s       3.09e-06s     C     2400        2   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  14.5%    14.5%       0.047s       3.90e-05s   1200     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  13.8%    28.3%       0.045s       3.71e-05s   1200     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  12.6%    40.9%       0.041s       3.38e-05s   1200    10                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
  12.6%    53.5%       0.041s       3.38e-05s   1200    11                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.7%    61.2%       0.025s       2.07e-05s   1200    12                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.6%    68.8%       0.024s       2.03e-05s   1200    13                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.8%    75.6%       0.022s       1.83e-05s   1200     8                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.6%    82.2%       0.021s       1.79e-05s   1200     9                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   6.4%    88.7%       0.021s       1.73e-05s   1200     2                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   6.4%    95.1%       0.021s       1.73e-05s   1200     3                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.3%    96.4%       0.004s       3.49e-06s   1200     6                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.3%    97.7%       0.004s       3.47e-06s   1200     4                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.1%    98.9%       0.004s       3.09e-06s   1200     5                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.1%   100.0%       0.004s       3.08e-06s   1200     7                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 39KB
        CPU + GPU: 39KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
   ... (remaining 0 Apply account for    0B/72000B ((0.00%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( generator_generate_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 2.965537e+00s

  Total time spent in calling the VM 2.812608e+00s (94.843%)
  Total overhead (computing slices..) 1.529298e-01s (5.157%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  29.5%    29.5%       0.372s       1.91e-05s     C    19500      13   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.2%    46.7%       0.217s       2.89e-05s     C     7500       5   theano.sandbox.cuda.blas.GpuGemm
  15.2%    61.9%       0.192s       2.56e-05s     C     7500       5   theano.sandbox.cuda.blas.GpuDot22
  11.6%    73.4%       0.146s       1.95e-05s     C     7500       5   theano.sandbox.cuda.basic_ops.GpuCAReduce
   5.2%    78.6%       0.065s       4.35e-05s     C     1500       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   5.2%    83.8%       0.065s       4.35e-05s     C     1500       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   4.4%    88.2%       0.056s       1.87e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.HostFromGpu
   2.2%    90.4%       0.028s       1.84e-05s     C     1500       1   theano.tensor.basic.MaxAndArgmax
   1.9%    92.3%       0.024s       1.59e-05s     C     1500       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.8%    94.1%       0.023s       2.56e-06s     C     9000       6   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.4%    95.5%       0.018s       2.38e-06s     C     7500       5   theano.compile.ops.Shape_i
   1.2%    96.8%       0.015s       3.41e-06s     C     4500       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.9%    97.7%       0.012s       1.97e-06s     C     6000       4   theano.tensor.opt.MakeVector
   0.8%    98.5%       0.010s       3.48e-06s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.6%    99.1%       0.007s       2.32e-06s     C     3000       2   theano.tensor.elemwise.Elemwise
   0.5%    99.6%       0.007s       4.48e-06s     C     1500       1   theano.sandbox.multinomial.MultinomialFromUniform
   0.4%   100.0%       0.005s       3.31e-06s     C     1500       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  17.2%    17.2%       0.217s       2.89e-05s     C     7500        5   GpuGemm{inplace}
  15.2%    32.4%       0.192s       2.56e-05s     C     7500        5   GpuDot22
   5.2%    37.6%       0.065s       4.35e-05s     C     1500        1   GpuAdvancedSubtensor1
   5.2%    42.7%       0.065s       4.35e-05s     C     1500        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
   5.0%    47.7%       0.063s       2.10e-05s     C     3000        2   GpuElemwise{mul,no_inplace}
   4.4%    52.2%       0.056s       1.87e-05s     C     3000        2   HostFromGpu
   2.8%    55.0%       0.036s       2.37e-05s     C     1500        1   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
   2.6%    57.6%       0.033s       2.17e-05s     C     1500        1   GpuCAReduce{add}{1,0,0}
   2.6%    60.1%       0.032s       2.15e-05s     C     1500        1   GpuCAReduce{maximum}{1,0}
   2.5%    62.6%       0.031s       2.09e-05s     C     1500        1   GpuElemwise{add,no_inplace}
   2.3%    64.9%       0.029s       1.91e-05s     C     1500        1   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   2.2%    67.1%       0.028s       1.88e-05s     C     1500        1   GpuCAReduce{maximum}{0,1}
   2.2%    69.3%       0.028s       1.84e-05s     C     1500        1   MaxAndArgmax
   2.2%    71.5%       0.027s       1.83e-05s     C     1500        1   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   2.2%    73.6%       0.027s       1.81e-05s     C     1500        1   GpuElemwise{Add}[(0, 1)]
   2.1%    75.8%       0.027s       1.80e-05s     C     1500        1   GpuElemwise{Tanh}[(0, 0)]
   2.1%    77.9%       0.027s       1.80e-05s     C     1500        1   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   2.1%    80.0%       0.027s       1.78e-05s     C     1500        1   GpuElemwise{TrueDiv}[(0, 0)]
   2.1%    82.1%       0.027s       1.77e-05s     C     1500        1   GpuCAReduce{add}{1,0}
   2.1%    84.2%       0.027s       1.77e-05s     C     1500        1   GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
   ... (remaining 20 Ops account for  15.75%(0.20s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.2%     5.2%       0.065s       4.35e-05s   1500    29                     GpuAdvancedSubtensor1(W_copy[cuda], argmax)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(10,), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.2%    10.3%       0.065s       4.35e-05s   1500    13                     GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(92160,), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(92160,), strides=c 
    output 1: dtype=float32, shape=(10,), strides=c 
   4.2%    14.6%       0.053s       3.55e-05s   1500    10                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   3.6%    18.2%       0.046s       3.06e-05s   1500    38                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.4%    21.6%       0.043s       2.90e-05s   1500     5                     GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   3.3%    25.0%       0.042s       2.79e-05s   1500     8                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 44), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   3.2%    28.1%       0.040s       2.67e-05s   1500    32                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   3.0%    31.1%       0.038s       2.52e-05s   1500     1                     GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   2.9%    34.1%       0.037s       2.47e-05s   1500    37                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.9%    37.0%       0.037s       2.46e-05s   1500    41                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.9%    39.9%       0.037s       2.44e-05s   1500    46                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=c 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=c 
   2.8%    42.7%       0.036s       2.39e-05s   1500    39                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.8%    45.6%       0.036s       2.37e-05s   1500    40                     GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.8%    48.3%       0.035s       2.34e-05s   1500    56                     GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
    input 0: dtype=float32, shape=(12, 10, 1), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   2.6%    50.9%       0.033s       2.17e-05s   1500    57                     GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.6%    53.5%       0.032s       2.15e-05s   1500    48                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=float32, shape=(10,), strides=c 
   2.5%    56.0%       0.031s       2.09e-05s   1500    43                     GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
   2.4%    58.4%       0.030s       2.01e-05s   1500    25                     HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 44), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   2.3%    60.6%       0.029s       1.91e-05s   1500    33                     GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.2%    62.9%       0.028s       1.88e-05s   1500    18                     GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
    input 0: dtype=float32, shape=(10, 44), strides=c 
    output 0: dtype=float32, shape=(10,), strides=c 
   ... (remaining 38 Apply instances account for 37.13%(0.47s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 5KB (5KB)
        GPU: 465KB (465KB)
        CPU + GPU: 471KB (471KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 5KB (5KB)
        GPU: 465KB (465KB)
        CPU + GPU: 471KB (471KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 5KB
        GPU: 540KB
        CPU + GPU: 545KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        368680B  [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace[cuda], GpuDimShuffle{x,0,1}.0)
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          4000B  [(10, 100)] v GpuReshape{2}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)
          4000B  [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
          4000B  [(10, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
          4000B  [(10, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
          4000B  [(10, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
          4000B  [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
   ... (remaining 38 Apply account for 21274B/709954B ((3.00%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 3.388380e+00s

  Total time spent in calling the VM 3.311884e+00s (97.742%)
  Total overhead (computing slices..) 7.649612e-02s (2.258%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  37.8%    37.8%       0.518s       1.92e-05s     C    27000      18   theano.sandbox.cuda.basic_ops.GpuElemwise
  22.4%    60.2%       0.307s       2.56e-05s     C    12000       8   theano.sandbox.cuda.blas.GpuDot22
  14.3%    74.4%       0.196s       3.26e-05s     C     6000       4   theano.sandbox.cuda.blas.GpuGemm
  12.6%    87.0%       0.172s       1.92e-05s     C     9000       6   theano.sandbox.cuda.basic_ops.GpuCAReduce
   3.5%    90.5%       0.047s       1.58e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   2.5%    93.0%       0.035s       2.56e-06s     C    13500       9   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.6%    94.5%       0.021s       2.37e-06s     C     9000       6   theano.compile.ops.Shape_i
   1.5%    96.0%       0.020s       3.37e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   1.4%    97.4%       0.019s       3.24e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuReshape
   1.0%    98.4%       0.014s       2.29e-06s     C     6000       4   theano.tensor.elemwise.Elemwise
   0.8%    99.3%       0.012s       1.94e-06s     C     6000       4   theano.tensor.opt.MakeVector
   0.7%   100.0%       0.010s       3.30e-06s     C     3000       2   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  22.4%    22.4%       0.307s       2.56e-05s     C     12000        8   GpuDot22
  14.3%    36.7%       0.196s       3.26e-05s     C     6000        4   GpuGemm{inplace}
   9.1%    45.8%       0.125s       2.09e-05s     C     6000        4   GpuElemwise{mul,no_inplace}
   4.8%    50.6%       0.065s       2.18e-05s     C     3000        2   GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}
   4.7%    55.3%       0.065s       2.17e-05s     C     3000        2   GpuCAReduce{maximum}{1,0}
   4.5%    59.8%       0.062s       2.07e-05s     C     3000        2   GpuElemwise{add,no_inplace}
   4.0%    63.8%       0.055s       1.82e-05s     C     3000        2   GpuCAReduce{add}{1,0,0}
   3.9%    67.7%       0.054s       1.79e-05s     C     3000        2   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   3.9%    71.6%       0.053s       1.78e-05s     C     3000        2   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   3.9%    75.5%       0.053s       1.77e-05s     C     3000        2   GpuElemwise{TrueDiv}[(0, 0)]
   3.9%    79.4%       0.053s       1.77e-05s     C     3000        2   GpuElemwise{Tanh}[(0, 0)]
   3.9%    83.2%       0.053s       1.76e-05s     C     3000        2   GpuCAReduce{add}{1,0}
   3.8%    87.0%       0.052s       1.72e-05s     C     3000        2   GpuElemwise{Add}[(0, 0)]
   3.5%    90.5%       0.047s       1.58e-05s     C     3000        2   GpuFromHost
   1.4%    91.9%       0.019s       3.24e-06s     C     6000        4   GpuReshape{2}
   0.8%    92.7%       0.012s       1.94e-06s     C     6000        4   MakeVector{dtype='int64'}
   0.8%    93.6%       0.012s       2.56e-06s     C     4500        3   GpuDimShuffle{x,0}
   0.8%    94.3%       0.011s       3.59e-06s     C     3000        2   GpuSubtensor{::, :int64:}
   0.7%    95.0%       0.009s       3.15e-06s     C     3000        2   GpuSubtensor{::, int64::}
   0.7%    95.7%       0.009s       3.02e-06s     C     3000        2   Shape_i{1}
   ... (remaining 10 Ops account for   4.30%(0.06s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   3.9%     3.9%       0.053s       3.54e-05s   1500    11                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   3.9%     7.7%       0.053s       3.54e-05s   1500    14                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   3.3%    11.0%       0.045s       2.99e-05s   1500    32                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   3.3%    14.3%       0.045s       2.99e-05s   1500    33                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   3.2%    17.5%       0.044s       2.95e-05s   1500     3                     GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   3.1%    20.6%       0.043s       2.87e-05s   1500     8                     GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.7%    23.3%       0.037s       2.46e-05s   1500    31                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    26.0%       0.037s       2.45e-05s   1500    30                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    28.7%       0.037s       2.44e-05s   1500    47                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=(1, 0) 
   2.7%    31.4%       0.037s       2.44e-05s   1500    36                     GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    34.0%       0.036s       2.43e-05s   1500    37                     GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    36.7%       0.036s       2.43e-05s   1500    46                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=(1, 0) 
   2.6%    39.2%       0.035s       2.35e-05s   1500    69                     GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
    input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
   2.5%    41.8%       0.035s       2.31e-05s   1500    65                     GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda])
    input 0: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
   2.4%    44.2%       0.033s       2.21e-05s   1500    50                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 10), strides=(10, 1) 
    output 0: dtype=float32, shape=(10,), strides=(1,) 
   2.4%    46.6%       0.033s       2.20e-05s   1500    34                     GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 4: dtype=float32, shape=(10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 1), strides=c 
    input 6: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    49.0%       0.032s       2.16e-05s   1500    35                     GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 4: dtype=float32, shape=(10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 1), strides=c 
    input 6: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.3%    51.3%       0.032s       2.13e-05s   1500    51                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 10), strides=(10, 1) 
    output 0: dtype=float32, shape=(10,), strides=(1,) 
   2.3%    53.6%       0.031s       2.08e-05s   1500    40                     GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
   2.2%    55.8%       0.031s       2.06e-05s   1500    41                     GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    output 0: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
   ... (remaining 51 Apply instances account for 44.20%(0.61s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 118KB (118KB)
        CPU + GPU: 118KB (118KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 118KB (149KB)
        CPU + GPU: 118KB (149KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 345KB
        CPU + GPU: 345KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace1[cuda])
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0)
          8000B  [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
          8000B  [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0)
          4000B  [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
          4000B  [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
          4000B  [(10, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
   ... (remaining 51 Apply account for 53988B/613988B ((8.79%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 8.660982e+00s

  Total time spent in calling the VM 8.414116e+00s (97.150%)
  Total overhead (computing slices..) 2.468655e-01s (2.850%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  45.6%    45.6%       1.678s       1.86e-05s     C    90000      60   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.5%    63.1%       0.643s       2.68e-05s     C    24000      16   theano.sandbox.cuda.blas.GpuDot22
  13.0%    76.1%       0.480s       3.20e-05s     C    15000      10   theano.sandbox.cuda.blas.GpuGemm
   9.5%    85.7%       0.351s       1.95e-05s     C    18000      12   theano.sandbox.cuda.basic_ops.GpuCAReduce
   3.0%    88.7%       0.111s       1.86e-05s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   2.4%    91.1%       0.088s       1.47e-05s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   2.2%    93.3%       0.081s       2.47e-06s     C    33000      22   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.4%    94.8%       0.053s       1.77e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   1.4%    96.2%       0.052s       3.47e-06s     C    15000      10   theano.sandbox.cuda.basic_ops.GpuReshape
   1.1%    97.3%       0.040s       2.25e-06s     C    18000      12   theano.compile.ops.Shape_i
   0.8%    98.1%       0.030s       2.52e-06s     C    12000       8   theano.tensor.elemwise.Elemwise
   0.7%    98.8%       0.025s       2.12e-06s     C    12000       8   theano.tensor.opt.MakeVector
   0.6%    99.4%       0.023s       3.89e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.6%   100.0%       0.021s       3.58e-06s     C     6000       4   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  17.5%    17.5%       0.643s       2.68e-05s     C     24000       16   GpuDot22
  10.2%    27.7%       0.374s       3.12e-05s     C     12000        8   GpuGemm{inplace}
   7.7%    35.4%       0.285s       1.90e-05s     C     15000       10   GpuElemwise{mul,no_inplace}
   4.6%    40.0%       0.168s       1.86e-05s     C     9000        6   GpuElemwise{add,no_inplace}
   4.2%    44.2%       0.156s       1.73e-05s     C     9000        6   GpuCAReduce{add}{1,0}
   3.7%    47.9%       0.136s       1.81e-05s     C     7500        5   GpuElemwise{Add}[(0, 1)]
   3.5%    51.4%       0.128s       1.71e-05s     C     7500        5   GpuElemwise{Add}[(0, 0)]
   2.9%    54.3%       0.106s       3.53e-05s     C     3000        2   GpuGemm{no_inplace}
   2.4%    56.7%       0.088s       1.47e-05s     C     6000        4   GpuFromHost
   2.3%    58.9%       0.084s       2.81e-05s     C     3000        2   GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
   1.9%    60.9%       0.071s       2.36e-05s     C     3000        2   GpuCAReduce{maximum}{1,0}
   1.9%    62.8%       0.070s       2.33e-05s     C     3000        2   GpuCAReduce{add}{0,0,1}
   1.7%    64.5%       0.063s       2.09e-05s     C     3000        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   1.7%    66.2%       0.062s       2.06e-05s     C     3000        2   GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)]
   1.6%    67.8%       0.060s       1.99e-05s     C     3000        2   GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)]
   1.6%    69.3%       0.057s       1.91e-05s     C     3000        2   GpuIncSubtensor{InplaceInc;::, int64::}
   1.5%    70.9%       0.057s       1.89e-05s     C     3000        2   GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace}
   1.5%    72.4%       0.057s       1.89e-05s     C     3000        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   1.5%    73.9%       0.056s       1.86e-05s     C     3000        2   GpuElemwise{Composite{(i0 + (i1 * i2 * i3))}}[(0, 0)]
   1.5%    75.4%       0.054s       1.81e-05s     C     3000        2   GpuCAReduce{add}{1,0,0}
   ... (remaining 30 Ops account for  24.59%(0.90s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   1.5%     1.5%       0.055s       3.65e-05s   1500    26                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.5%     3.0%       0.054s       3.61e-05s   1500    35                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.5%     4.4%       0.054s       3.57e-05s   1500   146                     GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.4%     5.9%       0.053s       3.54e-05s   1500   166                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(200, 200), strides=(1, 200) 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.4%     7.3%       0.053s       3.53e-05s   1500   168                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(200, 200), strides=(1, 200) 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.4%     8.7%       0.052s       3.48e-05s   1500   147                     GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.3%    10.0%       0.047s       3.15e-05s   1500    80                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.3%    11.3%       0.047s       3.15e-05s   1500    82                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.2%    12.5%       0.046s       3.04e-05s   1500   167                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.2%    13.8%       0.045s       3.02e-05s   1500   169                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.2%    15.0%       0.044s       2.96e-05s   1500     2                     GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.2%    16.1%       0.043s       2.88e-05s   1500    15                     GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.2%    17.3%       0.043s       2.86e-05s   1500   116                     GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
    input 0: dtype=float32, shape=(1, 10, 200), strides=c 
    input 1: dtype=float32, shape=(12, 10, 1), strides=c 
    input 2: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   1.1%    18.4%       0.042s       2.77e-05s   1500   117                     GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
    input 0: dtype=float32, shape=(1, 10, 200), strides=c 
    input 1: dtype=float32, shape=(12, 10, 1), strides=c 
    input 2: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   1.1%    19.5%       0.041s       2.76e-05s   1500   133                     GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 120), strides=c 
    input 1: dtype=float32, shape=(120, 1), strides=c 
    output 0: dtype=float32, shape=(100, 1), strides=c 
   1.1%    20.7%       0.041s       2.75e-05s   1500   131                     GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 120), strides=c 
    input 1: dtype=float32, shape=(120, 1), strides=c 
    output 0: dtype=float32, shape=(100, 1), strides=c 
   1.1%    21.8%       0.041s       2.71e-05s   1500    21                     GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.1%    22.9%       0.041s       2.70e-05s   1500     8                     GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.1%    24.0%       0.040s       2.67e-05s   1500   172                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.1%    25.1%       0.040s       2.67e-05s   1500   170                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   ... (remaining 156 Apply instances account for 74.95%(2.76s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 275KB (376KB)
        CPU + GPU: 275KB (377KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 299KB (377KB)
        CPU + GPU: 299KB (378KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 890KB
        CPU + GPU: 890KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         96000B  [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
         48000B  [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(100, 120)] v GpuDimShuffle{1,0}(GpuElemwise{tanh,no_inplace}.0)
         48000B  [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0)
         48000B  [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
         48000B  [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] c GpuElemwise{tanh,no_inplace}(GpuReshape{2}.0)
         48000B  [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
         48000B  [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)](GpuReshape{3}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{add,no_inplace}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(tanh(i2))))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
   ... (remaining 156 Apply account for 346392B/1498392B ((23.12%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 2.039070e+00s

  Total time spent in calling the VM 1.921504e+00s (94.234%)
  Total overhead (computing slices..) 1.175666e-01s (5.766%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  47.2%    47.2%       0.424s       1.77e-05s     C    24000      20   theano.sandbox.cuda.basic_ops.GpuElemwise
  27.9%    75.1%       0.251s       3.48e-05s     C     7200       6   theano.sandbox.cuda.blas.GpuGemm
   9.8%    84.8%       0.088s       1.83e-05s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   6.7%    91.6%       0.060s       2.52e-05s     C     2400       2   theano.sandbox.cuda.blas.GpuDot22
   4.6%    96.2%       0.041s       1.73e-05s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   2.0%    98.1%       0.018s       3.69e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   1.2%    99.4%       0.011s       2.34e-06s     C     4800       4   theano.compile.ops.Shape_i
   0.6%   100.0%       0.005s       2.25e-06s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  19.8%    19.8%       0.178s       3.71e-05s     C     4800        4   GpuGemm{no_inplace}
  14.5%    34.3%       0.130s       1.81e-05s     C     7200        6   GpuElemwise{mul,no_inplace}
   8.1%    42.4%       0.073s       3.03e-05s     C     2400        2   GpuGemm{inplace}
   6.7%    49.1%       0.060s       2.52e-05s     C     2400        2   GpuDot22
   5.4%    54.5%       0.048s       2.01e-05s     C     2400        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   5.0%    59.5%       0.045s       1.89e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, int64::}
   4.9%    64.5%       0.044s       1.85e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   4.7%    69.2%       0.043s       1.77e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, :int64:}
   4.6%    73.8%       0.042s       1.74e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   4.6%    78.4%       0.041s       1.73e-05s     C     2400        2   GpuAlloc{memset_0=True}
   4.5%    83.0%       0.041s       1.70e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
   4.5%    87.5%       0.041s       1.69e-05s     C     2400        2   GpuElemwise{Tanh}[(0, 0)]
   4.4%    91.9%       0.040s       1.65e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
   4.3%    96.2%       0.038s       1.60e-05s     C     2400        2   GpuElemwise{Mul}[(0, 0)]
   1.0%    97.2%       0.009s       3.79e-06s     C     2400        2   GpuSubtensor{::, int64::}
   1.0%    98.1%       0.009s       3.60e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   0.6%    98.8%       0.006s       2.41e-06s     C     2400        2   Shape_i{1}
   0.6%    99.4%       0.005s       2.27e-06s     C     2400        2   Shape_i{0}
   0.6%   100.0%       0.005s       2.25e-06s     C     2400        2   GpuDimShuffle{1,0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.4%     5.4%       0.049s       4.07e-05s   1200     2                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   5.2%    10.6%       0.046s       3.86e-05s   1200     6                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   4.6%    15.2%       0.041s       3.45e-05s   1200    20                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.6%    19.8%       0.041s       3.45e-05s   1200    18                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.0%    23.9%       0.036s       3.03e-05s   1200    40                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.0%    27.9%       0.036s       3.03e-05s   1200    41                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   3.4%    31.3%       0.030s       2.52e-05s   1200    28                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   3.3%    34.6%       0.030s       2.51e-05s   1200    29                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    37.3%       0.024s       2.02e-05s   1200    42                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=c 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.7%    40.0%       0.024s       2.00e-05s   1200    43                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=c 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.6%    42.5%       0.023s       1.91e-05s   1200    34                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.5%    45.0%       0.022s       1.87e-05s   1200    24                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.5%    47.5%       0.022s       1.86e-05s   1200    35                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.5%    50.0%       0.022s       1.85e-05s   1200    16                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    52.4%       0.022s       1.83e-05s   1200    25                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    54.9%       0.022s       1.83e-05s   1200    17                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    57.3%       0.022s       1.81e-05s   1200     3                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    59.7%       0.022s       1.79e-05s   1200    31                     GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    62.1%       0.021s       1.79e-05s   1200     7                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.4%    64.5%       0.021s       1.79e-05s   1200    30                     GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   ... (remaining 24 Apply instances account for 35.55%(0.32s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 55KB (78KB)
        CPU + GPU: 55KB (78KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 66KB (86KB)
        CPU + GPU: 66KB (86KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 94KB
        CPU + GPU: 94KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
   ... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 2.040347e+00s

  Total time spent in calling the VM 1.923158e+00s (94.256%)
  Total overhead (computing slices..) 1.171889e-01s (5.744%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  47.2%    47.2%       0.424s       1.77e-05s     C    24000      20   theano.sandbox.cuda.basic_ops.GpuElemwise
  27.9%    75.1%       0.251s       3.49e-05s     C     7200       6   theano.sandbox.cuda.blas.GpuGemm
   9.8%    84.9%       0.088s       1.83e-05s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   6.7%    91.6%       0.060s       2.51e-05s     C     2400       2   theano.sandbox.cuda.blas.GpuDot22
   4.6%    96.2%       0.041s       1.73e-05s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   2.0%    98.2%       0.018s       3.68e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   1.2%    99.4%       0.011s       2.29e-06s     C     4800       4   theano.compile.ops.Shape_i
   0.6%   100.0%       0.005s       2.28e-06s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  19.8%    19.8%       0.178s       3.71e-05s     C     4800        4   GpuGemm{no_inplace}
  14.5%    34.3%       0.130s       1.81e-05s     C     7200        6   GpuElemwise{mul,no_inplace}
   8.1%    42.4%       0.073s       3.04e-05s     C     2400        2   GpuGemm{inplace}
   6.7%    49.1%       0.060s       2.51e-05s     C     2400        2   GpuDot22
   5.4%    54.5%       0.048s       2.01e-05s     C     2400        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   5.0%    59.5%       0.045s       1.89e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, int64::}
   5.0%    64.5%       0.045s       1.86e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   4.7%    69.2%       0.042s       1.77e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, :int64:}
   4.6%    73.9%       0.042s       1.73e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   4.6%    78.5%       0.041s       1.73e-05s     C     2400        2   GpuAlloc{memset_0=True}
   4.5%    83.0%       0.041s       1.70e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
   4.5%    87.5%       0.040s       1.68e-05s     C     2400        2   GpuElemwise{Tanh}[(0, 0)]
   4.4%    91.9%       0.040s       1.66e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
   4.3%    96.2%       0.039s       1.61e-05s     C     2400        2   GpuElemwise{Mul}[(0, 0)]
   1.0%    97.2%       0.009s       3.78e-06s     C     2400        2   GpuSubtensor{::, int64::}
   1.0%    98.2%       0.009s       3.59e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   0.6%    98.8%       0.006s       2.37e-06s     C     2400        2   Shape_i{1}
   0.6%    99.4%       0.005s       2.28e-06s     C     2400        2   GpuDimShuffle{1,0}
   0.6%   100.0%       0.005s       2.21e-06s     C     2400        2   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.4%     5.4%       0.049s       4.08e-05s   1200     2                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   5.2%    10.6%       0.046s       3.86e-05s   1200     6                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   4.6%    15.2%       0.041s       3.45e-05s   1200    20                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   4.6%    19.8%       0.041s       3.45e-05s   1200    18                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   4.1%    23.9%       0.036s       3.04e-05s   1200    40                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   4.1%    27.9%       0.036s       3.04e-05s   1200    41                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.4%    31.3%       0.030s       2.51e-05s   1200    28                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.3%    34.6%       0.030s       2.51e-05s   1200    29                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    37.3%       0.024s       2.02e-05s   1200    42                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=c 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    40.0%       0.024s       2.01e-05s   1200    43                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=c 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.6%    42.6%       0.023s       1.91e-05s   1200    34                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.5%    45.1%       0.023s       1.88e-05s   1200    24                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.5%    47.6%       0.022s       1.86e-05s   1200    35                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.5%    50.0%       0.022s       1.84e-05s   1200    25                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.5%    52.5%       0.022s       1.84e-05s   1200    16                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    54.9%       0.022s       1.82e-05s   1200    17                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    57.3%       0.022s       1.80e-05s   1200     3                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    59.7%       0.022s       1.79e-05s   1200    30                     GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    62.1%       0.021s       1.79e-05s   1200    31                     GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    64.5%       0.021s       1.79e-05s   1200     7                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   ... (remaining 24 Apply instances account for 35.49%(0.32s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 55KB (78KB)
        CPU + GPU: 55KB (78KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 66KB (86KB)
        CPU + GPU: 66KB (86KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 94KB
        CPU + GPU: 94KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
          4000B  [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda])
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] i GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
   ... (remaining 24 Apply account for 80032B/208032B ((38.47%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: Sum of all(17) printed profiles at exit excluding Scan op profile.
  Time in 6938 calls to Function.__call__: 1.007439e+02s
  Time in Function.fn.__call__: 1.003767e+02s (99.635%)
  Time in thunks: 3.835574e+01s (38.073%)
  Total compile time: 3.784477e+02s
    Number of Apply nodes: 0
    Theano Optimizer time: 1.654243e+02s
       Theano validate time: 5.543999e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.313228e+02s
       Import time 2.099285e+00s

 Time in all call to theano.grad() 2.838947e+00s
 Time since theano import 676.605s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  61.4%    61.4%      23.536s       2.79e-02s     Py     844      11   theano.scan_module.scan_op.Scan
  25.3%    86.7%       9.712s       6.03e-02s     Py     161       2   lvsr.ops.EditDistanceOp
   4.7%    91.3%       1.787s       2.06e-05s     C    86853     879   theano.sandbox.cuda.basic_ops.GpuElemwise
   1.8%    93.1%       0.678s       2.65e-05s     C    25580     252   theano.sandbox.cuda.basic_ops.GpuCAReduce
   1.7%    94.8%       0.642s       7.29e-05s     C     8805      89   theano.sandbox.cuda.blas.GpuDot22
   1.0%    95.8%       0.395s       3.60e-06s     C   109687    1234   theano.tensor.elemwise.Elemwise
   0.8%    96.6%       0.297s       1.72e-05s     C    17247     197   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.4%    97.0%       0.166s       2.21e-05s     Py    7505      51   theano.ifelse.IfElse
   0.4%    97.4%       0.161s       2.71e-05s     C     5927      63   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.4%    97.8%       0.142s       7.60e-06s     C    18640     198   theano.sandbox.cuda.basic_ops.GpuReshape
   0.4%    98.2%       0.138s       2.62e-05s     C     5266      56   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.3%    98.5%       0.127s       3.37e-06s     C    37733     384   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    98.8%       0.118s       7.43e-06s     C    15813     114   theano.compile.ops.DeepCopyOp
   0.1%    99.0%       0.057s       3.66e-06s     C    15701     169   theano.tensor.opt.MakeVector
   0.1%    99.1%       0.054s       1.60e-05s     C     3393      29   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.1%    99.2%       0.050s       4.52e-06s     C    11167     119   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.1%    99.4%       0.048s       3.42e-06s     C    14141     158   theano.compile.ops.Shape_i
   0.1%    99.4%       0.034s       5.30e-05s     C      648       7   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   0.1%    99.5%       0.033s       2.96e-06s     C    10969     127   theano.tensor.basic.ScalarFromTensor
   0.1%    99.6%       0.032s       8.55e-05s     C      372       5   theano.sandbox.cuda.basic_ops.GpuJoin
   ... (remaining 22 Classes account for   0.38%(0.15s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.3%    25.3%       9.712s       6.03e-02s     Py     161        2   EditDistanceOp
  22.7%    48.0%       8.707s       8.71e-02s     Py     100        1   forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
  13.7%    61.8%       5.270s       3.27e-02s     Py     161        2   forall_inplace,gpu,generator_generate_scan}
  10.7%    72.5%       4.113s       2.06e-02s     Py     200        2   forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
   8.9%    81.4%       3.412s       3.41e-02s     Py     100        1   forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
   5.1%    86.5%       1.957s       7.50e-03s     Py     261        3   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   1.7%    88.2%       0.642s       7.29e-05s     C     8805       89   GpuDot22
   0.8%    88.9%       0.297s       1.72e-05s     C     17247      197   HostFromGpu
   0.7%    89.6%       0.262s       3.12e-05s     C     8400       84   GpuCAReduce{pre=sqr,red=add}{1,1}
   0.6%    90.2%       0.235s       2.12e-05s     C     11100      111   GpuElemwise{add,no_inplace}
   0.5%    90.7%       0.186s       2.12e-05s     C     8783       89   GpuElemwise{sub,no_inplace}
   0.4%    91.1%       0.152s       2.45e-05s     Py    6200       39   if{gpu}
   0.4%    91.5%       0.148s       2.28e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
   0.4%    91.9%       0.143s       2.99e-05s     C     4800       48   GpuCAReduce{add}{1,1}
   0.4%    92.2%       0.138s       2.16e-05s     C     6400       64   GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
   0.3%    92.6%       0.128s       1.97e-05s     C     6500       65   GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
   0.3%    92.9%       0.128s       1.88e-05s     C     6800       68   GpuElemwise{Mul}[(0, 0)]
   0.3%    93.2%       0.127s       2.15e-05s     C     5900       59   GpuElemwise{Switch,no_inplace}
   0.3%    93.6%       0.126s       1.95e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
   0.3%    93.9%       0.121s       2.06e-05s     C     5900       59   GpuElemwise{Composite{(i0 * (i1 ** i2))},no_inplace}
   ... (remaining 321 Ops account for   6.12%(2.35s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  22.7%    22.7%       8.707s       8.71e-02s    100   2437                     forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, 
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) 
    input 2: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 3: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 5: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 6: dtype=float32, shape=(15, 10, 1), strides=(-10, 1, 0) 
    input 7: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 8: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 9: dtype=float32, shape=(15, 10, 12), strides=(120, 12, 1) 
    input 10: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 11: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=float32, shape=(15, 10, 100), strides=(-1000, 100, 1) 
    input 14: dtype=float32, shape=(15, 10, 200), strides=(-2000, 200, 1) 
    input 15: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    input 16: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    input 17: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 18: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    input 19: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    input 20: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 21: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    input 22: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    input 23: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    input 24: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    input 25: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    input 26: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    input 27: dtype=int64, shape=(), strides=c 
    input 28: dtype=int64, shape=(), strides=c 
    input 29: dtype=int64, shape=(), strides=c 
    input 30: dtype=int64, shape=(), strides=c 
    input 31: dtype=int64, shape=(), strides=c 
    input 32: dtype=int64, shape=(), strides=c 
    input 33: dtype=int64, shape=(), strides=c 
    input 34: dtype=int64, shape=(), strides=c 
    input 35: dtype=float32, shape=(100, 200), strides=c 
    input 36: dtype=float32, shape=(200, 200), strides=c 
    input 37: dtype=float32, shape=(100, 100), strides=c 
    input 38: dtype=float32, shape=(200, 100), strides=c 
    input 39: dtype=float32, shape=(100, 100), strides=c 
    input 40: dtype=float32, shape=(200, 200), strides=(1, 200) 
    input 41: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 42: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 43: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 44: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 45: dtype=int64, shape=(2,), strides=c 
    input 46: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 47: dtype=int64, shape=(1,), strides=c 
    input 48: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 49: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 50: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 51: dtype=int8, shape=(10,), strides=c 
    input 52: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 53: dtype=float32, shape=(100, 200), strides=c 
    input 54: dtype=float32, shape=(200, 200), strides=c 
    input 55: dtype=float32, shape=(100, 100), strides=c 
    input 56: dtype=float32, shape=(200, 100), strides=c 
    input 57: dtype=float32, shape=(100, 100), strides=c 
    input 58: dtype=float32, shape=(200, 200), strides=(1, 200) 
    input 59: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 60: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 61: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 62: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 63: dtype=int64, shape=(2,), strides=c 
    input 64: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 65: dtype=int64, shape=(1,), strides=c 
    input 66: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 67: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 68: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 69: dtype=int8, shape=(10,), strides=c 
    input 70: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 3: dtype=float32, shape=(16, 10, 100), strides=(-1000, 100, 1) 
    output 4: dtype=float32, shape=(16, 10, 200), strides=(-2000, 200, 1) 
    output 5: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 6: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    output 7: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    output 8: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    output 9: dtype=float32, shape=(2, 100, 1), strides=(100, 1, 0) 
    output 10: dtype=float32, shape=(2, 12, 10, 200), strides=(24000, 2000, 200, 1) 
    output 11: dtype=float32, shape=(2, 12, 10, 100), strides=(12000, 1000, 100, 1) 
    output 12: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 13: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    output 14: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 15: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) 
    output 16: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 17: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    output 18: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    output 19: dtype=float32, shape=(15, 100, 10), strides=(1000, 10, 1) 
  22.6%    45.3%       8.684s       1.42e-01s     61   269                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
    input 0: dtype=int64, shape=(15, 75), strides=c 
    input 1: dtype=float32, shape=(15, 75), strides=c 
    input 2: dtype=int64, shape=(12, 75), strides=c 
    input 3: dtype=float32, shape=(12, 75), strides=c 
    output 0: dtype=int64, shape=(15, 75, 1), strides=c 
   8.9%    54.2%       3.412s       3.41e-02s    100   2149                     forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Elemwise{Composite{maximum(minimum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum(maximum((i0 - i1), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), (i3 - i1)), (i0 - i1)), (i2 - i1)), (i3 - i1)), (i0 - i1)), (i3 - i1)), i4), i1)}}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(15, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(15, 10, 200), strides=(2000, 200, 1) 
    input 6: dtype=float32, shape=(15, 10, 100), strides=(1000, 100, 1) 
    input 7: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    input 8: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    input 9: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    input 10: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    input 12: dtype=float32, shape=(100, 200), strides=c 
    input 13: dtype=float32, shape=(200, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(200, 100), strides=c 
    input 16: dtype=float32, shape=(100, 100), strides=c 
    input 17: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 18: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 19: dtype=int64, shape=(1,), strides=c 
    input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 21: dtype=int8, shape=(10,), strides=c 
    input 22: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(200, 200), strides=c 
    input 25: dtype=float32, shape=(100, 100), strides=c 
    input 26: dtype=float32, shape=(200, 100), strides=c 
    input 27: dtype=float32, shape=(100, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 29: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 30: dtype=int64, shape=(1,), strides=c 
    input 31: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 32: dtype=int8, shape=(10,), strides=c 
    input 33: dtype=float32, shape=(100, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
    output 2: dtype=float32, shape=(16, 10, 12), strides=(120, 12, 1) 
    output 3: dtype=float32, shape=(16, 10, 100), strides=(1000, 100, 1) 
    output 4: dtype=float32, shape=(16, 10, 200), strides=(2000, 200, 1) 
   7.8%    62.0%       2.984s       2.98e-02s    100   1850                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, G
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 10), strides=(10, 1) 
    input 20: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(10,), strides=c 
    input 23: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 0: dtype=float32, shape=(1, 10, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 10, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 10), strides=c 
   6.0%    68.0%       2.286s       3.75e-02s     61   260                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 75), strides=(75, 1) 
    input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(75,), strides=c 
    input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 75), strides=c 
   5.4%    73.3%       2.057s       2.06e-02s    100   2632                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtenso
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=int64, shape=(), strides=c 
    input 18: dtype=int64, shape=(), strides=c 
    input 19: dtype=float32, shape=(100, 200), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 22: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    input 25: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 26: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
    output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
   5.4%    78.7%       2.056s       2.06e-02s    100   2631                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 7: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 8: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 9: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 10: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 11: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 12: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=int64, shape=(), strides=c 
    input 18: dtype=int64, shape=(), strides=c 
    input 19: dtype=float32, shape=(100, 200), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 22: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    input 25: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 26: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(-1000, 100, 1) 
    output 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 3: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 4: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
    output 5: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    output 6: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    output 7: dtype=float32, shape=(12, 100, 10), strides=(1000, 10, 1) 
   2.7%    81.4%       1.028s       1.03e-02s    100   2005                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask11)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    input 1: dtype=float32, shape=(15, 10), strides=c 
    input 2: dtype=int64, shape=(12, 10), strides=c 
    input 3: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10, 1), strides=c 
   1.8%    83.2%       0.696s       6.96e-03s    100   1642                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
   1.8%    85.0%       0.694s       6.94e-03s    100   1652                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=(2000, 200, 1) 
    input 2: dtype=float32, shape=(12, 10, 100), strides=(1000, 100, 1) 
    input 3: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 4: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 5: dtype=float32, shape=(12, 10, 200), strides=(-2000, 200, 1) 
    input 6: dtype=float32, shape=(12, 10, 100), strides=(-1000, 100, 1) 
    input 7: dtype=float32, shape=(12, 10, 1), strides=(-10, 1, 0) 
    input 8: dtype=float32, shape=(12, 10, 1), strides=(10, 1, 0) 
    input 9: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 10: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
    output 1: dtype=float32, shape=(13, 10, 100), strides=(1000, 100, 1) 
   1.5%    86.5%       0.567s       9.29e-03s     61   247                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 4: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 5: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) 
    input 6: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    input 7: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) 
    input 8: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 9: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 10: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.1%    86.6%       0.039s       3.52e-03s     11   133                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.1%    86.7%       0.038s       3.46e-03s     11   175                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.1%    86.7%       0.024s       4.01e-06s   6075     0                     DeepCopyOp(labels)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   0.0%    86.8%       0.019s       3.10e-04s     61   140                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.0%    86.8%       0.018s       3.03e-04s     61   142                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.0%    86.9%       0.016s       2.64e-06s   6075     1                     DeepCopyOp(inputs)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   0.0%    86.9%       0.013s       1.31e-04s    100   2467                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(200, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(200, 200), strides=(200, 1) 
   0.0%    87.0%       0.013s       1.31e-04s    100   2463                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(200, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(200, 200), strides=(200, 1) 
   0.0%    87.0%       0.013s       1.28e-04s    100   2462                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 150), strides=(150, 1) 
    input 1: dtype=float32, shape=(150, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(100, 200), strides=(200, 1) 
   ... (remaining 4255 Apply instances account for 13.01%(4.99s) of the runtime)

 Memory Profile (the max between all functions in that profile)
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 58KB (62KB)
        GPU: 3739KB (5373KB)
        CPU + GPU: 3797KB (5435KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 57KB (62KB)
        GPU: 5605KB (6697KB)
        CPU + GPU: 5662KB (6758KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 114KB
        GPU: 17091KB
        CPU + GPU: 17205KB
 ---

    This list is based on all functions in the profile
    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

       1576960B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(recognizer_generate_n_steps000000000111111111, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, recognizer_generate_n_steps000000000111111111, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
        836280B  [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        750480B  [(1, 10, 100), (1, 10, 200), (2, 92160), (15, 10)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps000000000111111111, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps000000000111111111, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
        720000B  [(900, 200)] v GpuReshape{2}(GpuJoin.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
        720000B  [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
   ... (remaining 4255 Apply account for 58951889B/73960729B ((79.71%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.