rizar · May 6, 2016 15:29
diff --git a/old.prof b/old.prof
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 100 calls to Function.__call__: 1.984119e-03s
  Time in Function.fn.__call__: 8.468628e-04s (42.682%)
  Total compile time: 5.483155e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 1.670289e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.310276e-04s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.781s
  No execution time accumulated (hint: try config profiling.time_thunks=1)
 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 11 calls to Function.__call__: 2.355814e-02s
  Time in Function.fn.__call__: 2.024937e-02s (85.955%)
  Time in thunks: 9.337664e-03s (39.637%)
  Total compile time: 6.343132e+00s
    Number of Apply nodes: 43
    Theano Optimizer time: 3.600280e-01s
       Theano validate time: 2.064705e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.223059e-01s
       Import time 3.409195e-02s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.781s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.009s       1.97e-05s     C      473      43   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.009s       1.97e-05s     C      473       43   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   4.8%     4.8%       0.000s       4.09e-05s     11     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.8%     7.6%       0.000s       2.36e-05s     11    21                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.8%    10.4%       0.000s       2.34e-05s     11    25                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.7%    13.0%       0.000s       2.27e-05s     11     8                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    15.7%       0.000s       2.23e-05s     11    27                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    18.3%       0.000s       2.21e-05s     11    23                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    20.9%       0.000s       2.21e-05s     11     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    23.5%       0.000s       2.19e-05s     11    32                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    26.0%       0.000s       2.19e-05s     11    17                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.6%    28.6%       0.000s       2.17e-05s     11    16                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    31.1%       0.000s       2.16e-05s     11    24                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    33.7%       0.000s       2.15e-05s     11    31                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    36.2%       0.000s       2.15e-05s     11    29                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    38.7%       0.000s       2.14e-05s     11     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    41.2%       0.000s       2.11e-05s     11     3                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    43.7%       0.000s       2.10e-05s     11    28                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    46.2%       0.000s       2.10e-05s     11    36                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    48.6%       0.000s       2.09e-05s     11    33                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    51.1%       0.000s       2.09e-05s     11     5                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   2.5%    53.5%       0.000s       2.09e-05s     11    35                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 23 Apply instances account for 46.46%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 43 Apply account for  192B/192B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 10 calls to Function.__call__: 1.226211e-02s
  Time in Function.fn.__call__: 1.183033e-02s (96.479%)
  Time in thunks: 4.946470e-03s (40.339%)
  Total compile time: 6.681131e+00s
    Number of Apply nodes: 29
    Theano Optimizer time: 1.198421e-01s
       Theano validate time: 2.441406e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.311059e-01s
       Import time 6.275487e-02s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.787s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  52.6%    52.6%       0.003s       1.73e-05s     C      150      15   theano.sandbox.cuda.basic_ops.HostFromGpu
  44.3%    96.8%       0.002s       2.43e-05s     C       90       9   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.2%   100.0%       0.000s       3.13e-06s     C       50       5   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  52.6%    52.6%       0.003s       1.73e-05s     C      150       15   HostFromGpu
  44.3%    96.8%       0.002s       2.43e-05s     C       90        9   GpuElemwise{true_div,no_inplace}
   3.2%   100.0%       0.000s       3.13e-06s     C       50        5   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  10.8%    10.8%       0.001s       5.32e-05s     10     0                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   6.2%    17.0%       0.000s       3.09e-05s     10    15                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   4.4%    21.4%       0.000s       2.20e-05s     10    13                     GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.3%    25.8%       0.000s       2.14e-05s     10     1                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_critic_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.2%    29.9%       0.000s       2.06e-05s     10    12                     GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    34.1%       0.000s       2.05e-05s     10     2                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_actor_entropy, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    38.2%       0.000s       2.05e-05s     10     4                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean2_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    42.3%       0.000s       2.03e-05s     10     5                     GpuElemwise{true_div,no_inplace}(shared_mean_last_character_cost, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    46.4%       0.000s       2.03e-05s     10     3                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_max_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   4.1%    50.5%       0.000s       2.01e-05s     10     6                     GpuElemwise{true_div,no_inplace}(shared_readout_costs_mean_expected_reward, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    54.1%       0.000s       1.77e-05s     10    19                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    57.5%       0.000s       1.72e-05s     10    16                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    61.0%       0.000s       1.70e-05s     10     7                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    64.4%       0.000s       1.67e-05s     10    17                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    67.7%       0.000s       1.67e-05s     10     8                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    71.0%       0.000s       1.64e-05s     10    18                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    74.4%       0.000s       1.64e-05s     10    26                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    77.7%       0.000s       1.64e-05s     10    21                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    81.0%       0.000s       1.63e-05s     10    27                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   3.3%    84.2%       0.000s       1.62e-05s     10    20                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=() 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 9 Apply instances account for 15.76%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 29 Apply account for  136B/136B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 101 calls to Function.__call__: 1.714706e-02s
  Time in Function.fn.__call__: 1.415157e-02s (82.531%)
  Time in thunks: 2.484560e-03s (14.490%)
  Total compile time: 6.216795e+00s
    Number of Apply nodes: 6
    Theano Optimizer time: 4.745817e-02s
       Theano validate time: 1.499653e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.376604e-02s
       Import time 1.632404e-02s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.791s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  54.5%    54.5%       0.001s       3.35e-06s     C      404       4   theano.compile.ops.Shape_i
  45.5%   100.0%       0.001s       5.60e-06s     C      202       2   theano.tensor.basic.Alloc
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  45.5%    45.5%       0.001s       5.60e-06s     C      202        2   Alloc
  31.1%    76.6%       0.001s       3.82e-06s     C      202        2   Shape_i{1}
  23.4%   100.0%       0.001s       2.88e-06s     C      202        2   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  28.2%    28.2%       0.001s       6.94e-06s    101     4                     Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
    input 0: dtype=int64, shape=(1, 1), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(15, 10), strides=c 
  19.2%    47.4%       0.000s       4.71e-06s    101     0                     Shape_i{1}(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  17.3%    64.7%       0.000s       4.27e-06s    101     5                     Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
    input 0: dtype=int64, shape=(1, 1), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(12, 10), strides=c 
  12.8%    77.5%       0.000s       3.16e-06s    101     1                     Shape_i{0}(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  11.9%    89.4%       0.000s       2.92e-06s    101     2                     Shape_i{1}(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  10.6%   100.0%       0.000s       2.60e-06s    101     3                     Shape_i{0}(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 2KB
        GPU: 0KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          1200B  [(15, 10)] c Alloc(TensorConstant{(1, 1) of 0}, Shape_i{0}.0, Shape_i{1}.0)
   ... (remaining 5 Apply account for  992B/2192B ((45.26%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 100 calls to Function.__call__: 1.662898e-02s
  Time in Function.fn.__call__: 1.507092e-02s (90.630%)
  Time in thunks: 1.027775e-02s (61.806%)
  Total compile time: 5.965592e+00s
    Number of Apply nodes: 2
    Theano Optimizer time: 1.966500e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.714872e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.793s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.010s       5.14e-05s     C      200       2   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.010s       5.14e-05s     C      200        2   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  95.6%    95.6%       0.010s       9.82e-05s    100     0                     DeepCopyOp(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10), strides=c 
   4.4%   100.0%       0.000s       4.54e-06s    100     1                     DeepCopyOp(shared_labels)
    input 0: dtype=int64, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(12, 10), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 2KB (2KB)
        GPU: 0KB (0KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 2KB
        GPU: 0KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          1200B  [(15, 10)] c DeepCopyOp(shared_recognizer_costs_prediction)
   ... (remaining 1 Apply account for  960B/2160B ((44.44%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 2 calls to Function.__call__: 5.192757e-03s
  Time in Function.fn.__call__: 4.395008e-03s (84.637%)
  Time in thunks: 1.830101e-03s (35.243%)
  Total compile time: 5.798583e+00s
    Number of Apply nodes: 31
    Theano Optimizer time: 1.590829e-01s
       Theano validate time: 1.525164e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.815388e-02s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.794s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.002s       2.95e-05s     C       62      31   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.002s       2.95e-05s     C       62       31   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   4.6%     4.6%       0.000s       4.20e-05s      2     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.3%     8.9%       0.000s       3.96e-05s      2     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.1%    13.1%       0.000s       3.79e-05s      2    23                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.1%    17.1%       0.000s       3.74e-05s      2    13                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.9%    21.0%       0.000s       3.55e-05s      2     4                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.7%    24.7%       0.000s       3.40e-05s      2    21                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.7%    28.5%       0.000s       3.40e-05s      2    14                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.7%    32.2%       0.000s       3.40e-05s      2     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    35.8%       0.000s       3.30e-05s      2     3                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    39.3%       0.000s       3.25e-05s      2     8                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.6%    42.9%       0.000s       3.25e-05s      2     7                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    46.4%       0.000s       3.21e-05s      2    15                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    49.9%       0.000s       3.21e-05s      2     9                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    53.3%       0.000s       3.16e-05s      2    16                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.5%    56.8%       0.000s       3.16e-05s      2     5                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    60.2%       0.000s       3.15e-05s      2    22                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    63.7%       0.000s       3.15e-05s      2    20                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    67.1%       0.000s       3.15e-05s      2    19                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    70.6%       0.000s       3.15e-05s      2    18                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   3.4%    74.0%       0.000s       3.11e-05s      2    17                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 11 Apply instances account for 26.04%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 31 Apply account for  140B/140B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 8.800030e-04s
  Time in Function.fn.__call__: 8.380413e-04s (95.232%)
  Time in thunks: 3.595352e-04s (40.856%)
  Total compile time: 6.387939e+00s
    Number of Apply nodes: 21
    Theano Optimizer time: 8.277297e-02s
       Theano validate time: 1.749992e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.883909e-02s
       Import time 4.663944e-03s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.798s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  53.4%    53.4%       0.000s       1.75e-05s     C       11      11   theano.sandbox.cuda.basic_ops.HostFromGpu
  42.3%    95.8%       0.000s       2.54e-05s     C        6       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   4.2%   100.0%       0.000s       3.81e-06s     C        4       4   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  53.4%    53.4%       0.000s       1.75e-05s     C       11       11   HostFromGpu
  42.3%    95.8%       0.000s       2.54e-05s     C        6        6   GpuElemwise{true_div,no_inplace}
   4.2%   100.0%       0.000s       3.81e-06s     C        4        4   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  11.7%    11.7%       0.000s       4.20e-05s      1     8                     GpuElemwise{true_div,no_inplace}(shared_weights_entropy, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.4%    18.0%       0.000s       2.29e-05s      1     1                     GpuElemwise{true_div,no_inplace}(shared_total_gradient_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.2%    24.2%       0.000s       2.22e-05s      1     3                     GpuElemwise{true_div,no_inplace}(shared_mask_density, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.1%    30.3%       0.000s       2.19e-05s      1     7                     GpuElemwise{true_div,no_inplace}(shared_mean_attended, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.1%    36.4%       0.000s       2.19e-05s      1     2                     GpuElemwise{true_div,no_inplace}(shared_total_step_norm, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   6.1%    42.5%       0.000s       2.19e-05s      1     0                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.9%    48.4%       0.000s       2.12e-05s      1     6                     GpuElemwise{true_div,no_inplace}(shared_mean_bottom_output, shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.0%    53.4%       0.000s       1.81e-05s      1    16                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.0%    58.5%       0.000s       1.81e-05s      1    12                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   5.0%    63.5%       0.000s       1.79e-05s      1    11                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.8%    68.2%       0.000s       1.72e-05s      1    17                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.8%    73.0%       0.000s       1.72e-05s      1    13                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.7%    77.7%       0.000s       1.69e-05s      1    18                     HostFromGpu(GpuElemwise{true_div,no_inplace}.0)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.7%    82.4%       0.000s       1.69e-05s      1     5                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.4%    86.9%       0.000s       1.60e-05s      1    10                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.4%    91.3%       0.000s       1.60e-05s      1     9                     HostFromGpu(shared_weights_penalty)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   4.4%    95.8%       0.000s       1.60e-05s      1     4                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   1.4%    97.1%       0.000s       5.01e-06s      1    19                     Elemwise{true_div,no_inplace}(HostFromGpu.0, shared_batch_size)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   1.1%    98.3%       0.000s       4.05e-06s      1    20                     Elemwise{true_div,no_inplace}(shared_train_cost, HostFromGpu.0)
    input 0: dtype=float64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   0.9%    99.1%       0.000s       3.10e-06s      1    15                     Elemwise{true_div,no_inplace}(shared_batch_size, HostFromGpu.0)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 1 Apply instances account for 0.86%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 21 Apply account for  100B/100B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:171
  Time in 1 calls to Function.__call__: 4.670620e-04s
  Time in Function.fn.__call__: 3.008842e-04s (64.421%)
  Time in thunks: 1.330376e-04s (28.484%)
  Total compile time: 7.051143e+00s
    Number of Apply nodes: 5
    Theano Optimizer time: 3.080988e-02s
       Theano validate time: 2.636909e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 9.856939e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.801s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.000s       2.66e-05s     C        5       5   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.000s       2.66e-05s     C        5        5   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  57.2%    57.2%       0.000s       7.61e-05s      1     0                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  16.5%    73.7%       0.000s       2.19e-05s      1     1                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  15.1%    88.7%       0.000s       2.00e-05s      1     2                     DeepCopyOp(CudaNdarrayConstant{0.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   8.2%    97.0%       0.000s       1.10e-05s      1     3                     DeepCopyOp(TensorConstant{0})
    input 0: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   3.0%   100.0%       0.000s       4.05e-06s      1     4                     DeepCopyOp(TensorConstant{0.0})
    input 0: dtype=float64, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 5 Apply account for   28B/28B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 1.440048e-04s
  Time in Function.fn.__call__: 1.199245e-04s (83.278%)
  Time in thunks: 3.504753e-05s (24.338%)
  Total compile time: 5.531962e+00s
    Number of Apply nodes: 3
    Theano Optimizer time: 2.350092e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.795074e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.802s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.4%    71.4%       0.000s       2.50e-05s     C        1       1   theano.sandbox.cuda.basic_ops.HostFromGpu
  17.0%    88.4%       0.000s       5.96e-06s     C        1       1   theano.compile.ops.DeepCopyOp
  11.6%   100.0%       0.000s       4.05e-06s     C        1       1   theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.4%    71.4%       0.000s       2.50e-05s     C        1        1   HostFromGpu
  17.0%    88.4%       0.000s       5.96e-06s     C        1        1   DeepCopyOp
  11.6%   100.0%       0.000s       4.05e-06s     C        1        1   Elemwise{true_div,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  71.4%    71.4%       0.000s       2.50e-05s      1     1                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
  17.0%    88.4%       0.000s       5.96e-06s      1     0                     DeepCopyOp(shared_batch_size)
    input 0: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
  11.6%   100.0%       0.000s       4.05e-06s      1     2                     Elemwise{true_div,no_inplace}(shared_mean_total_reward, HostFromGpu.0)
    input 0: dtype=float64, shape=(), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    output 0: dtype=float64, shape=(), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 3 Apply account for   20B/20B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
  Time in 61 calls to Function.__call__: 1.151932e+01s
  Time in Function.fn.__call__: 1.151220e+01s (99.938%)
  Time in thunks: 1.112233e+01s (96.554%)
  Total compile time: 6.020690e+01s
    Number of Apply nodes: 284
    Theano Optimizer time: 6.218818e+00s
       Theano validate time: 2.867708e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.509264e+01s
       Import time 3.776977e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.803s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  76.0%    76.0%       8.452s       1.39e-01s     Py      61       1   lvsr.ops.EditDistanceOp
  23.0%    99.0%       2.554s       2.09e-02s     Py     122       2   theano.scan_module.scan_op.Scan
   0.2%    99.1%       0.021s       2.85e-06s     C     7320     120   theano.tensor.elemwise.Elemwise
   0.2%    99.3%       0.020s       6.64e-05s     C      305       5   theano.sandbox.cuda.blas.GpuDot22
   0.1%    99.4%       0.013s       3.00e-05s     C      427       7   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.1%    99.5%       0.008s       3.37e-05s     C      244       4   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.1%    99.6%       0.007s       1.22e-04s     C       61       1   theano.sandbox.cuda.basic_ops.GpuJoin
   0.1%    99.6%       0.007s       2.16e-05s     C      305       5   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.0%    99.7%       0.005s       2.85e-06s     C     1586      26   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%    99.7%       0.004s       2.92e-06s     C     1464      24   theano.compile.ops.Shape_i
   0.0%    99.8%       0.004s       2.23e-05s     C      183       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.0%    99.8%       0.003s       5.64e-05s     C       61       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   0.0%    99.8%       0.003s       3.31e-06s     C     1037      17   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%    99.8%       0.003s       2.76e-05s     C      122       2   theano.compile.ops.DeepCopyOp
   0.0%    99.9%       0.003s       2.74e-06s     C     1098      18   theano.tensor.opt.MakeVector
   0.0%    99.9%       0.002s       2.32e-06s     C     1037      17   theano.tensor.basic.ScalarFromTensor
   0.0%    99.9%       0.002s       7.52e-06s     Py     305       3   theano.ifelse.IfElse
   0.0%    99.9%       0.002s       4.18e-06s     C      549       9   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.002s       5.26e-06s     C      305       5   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   0.0%   100.0%       0.001s       6.62e-06s     Py     183       3   theano.compile.ops.Rebroadcast
   ... (remaining 8 Classes account for   0.04%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  76.0%    76.0%       8.452s       1.39e-01s     Py      61        1   EditDistanceOp
  18.2%    94.2%       2.026s       3.32e-02s     Py      61        1   forall_inplace,gpu,generator_generate_scan}
   4.8%    99.0%       0.528s       8.66e-03s     Py      61        1   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   0.2%    99.1%       0.020s       6.64e-05s     C      305        5   GpuDot22
   0.1%    99.2%       0.010s       3.33e-05s     C      305        5   GpuElemwise{Add}[(0, 0)]
   0.1%    99.3%       0.007s       1.22e-04s     C       61        1   GpuJoin
   0.1%    99.4%       0.007s       3.74e-05s     C      183        3   GpuAlloc
   0.1%    99.4%       0.007s       2.16e-05s     C      305        5   GpuIncSubtensor{InplaceSet;:int64:}
   0.0%    99.5%       0.004s       2.23e-05s     C      183        3   HostFromGpu
   0.0%    99.5%       0.003s       5.64e-05s     C       61        1   GpuAdvancedSubtensor1
   0.0%    99.5%       0.003s       2.76e-05s     C      122        2   DeepCopyOp
   0.0%    99.5%       0.003s       2.74e-06s     C     1098       18   MakeVector{dtype='int64'}
   0.0%    99.6%       0.002s       2.32e-06s     C     1037       17   ScalarFromTensor
   0.0%    99.6%       0.002s       2.80e-06s     C      793       13   Shape_i{0}
   0.0%    99.6%       0.002s       3.21e-06s     C      671       11   GpuReshape{2}
   0.0%    99.6%       0.002s       3.06e-06s     C      671       11   Shape_i{1}
   0.0%    99.6%       0.002s       2.64e-06s     C      671       11   Elemwise{add,no_inplace}
   0.0%    99.7%       0.002s       2.66e-06s     C      610       10   Elemwise{sub,no_inplace}
   0.0%    99.7%       0.002s       5.26e-06s     C      305        5   GpuAllocEmpty
   0.0%    99.7%       0.002s       8.29e-06s     Py     183        2   if{inplace}
   ... (remaining 76 Ops account for   0.32%(0.04s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  76.0%    76.0%       8.452s       1.39e-01s     61   279                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
    input 0: dtype=int64, shape=(15, 75), strides=c 
    input 1: dtype=float32, shape=(15, 75), strides=c 
    input 2: dtype=int64, shape=(12, 75), strides=c 
    input 3: dtype=float32, shape=(12, 75), strides=c 
    output 0: dtype=int64, shape=(15, 75, 1), strides=c 
  18.2%    94.2%       2.026s       3.32e-02s     61   268                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 75), strides=(75, 1) 
    input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(75,), strides=c 
    input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 75), strides=c 
   4.8%    99.0%       0.528s       8.66e-03s     61   254                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) 
    input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) 
    input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 9: dtype=float32, shape=(100, 200), strides=c 
    input 10: dtype=float32, shape=(100, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.1%    99.0%       0.007s       1.22e-04s     61   262                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.0%    99.1%       0.005s       7.75e-05s     61   148                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.0%    99.1%       0.005s       7.65e-05s     61   150                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
   0.0%    99.1%       0.004s       7.11e-05s     61   265                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.2%       0.003s       5.64e-05s     61    72                     GpuAdvancedSubtensor1(W, Reshape{1}.0)
    input 0: dtype=float32, shape=(44, 100), strides=c 
    input 1: dtype=int64, shape=(900,), strides=c 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.2%       0.003s       5.48e-05s     61   147                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.2%       0.003s       5.23e-05s     61   149                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(900, 100), strides=(100, 1) 
   0.0%    99.3%       0.002s       3.99e-05s     61    53                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.3%       0.002s       3.81e-05s     61   178                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.0%    99.3%       0.002s       3.76e-05s     61   180                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
   0.0%    99.3%       0.002s       3.63e-05s     61    65                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.3%       0.002s       3.61e-05s     61   116                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, generator_generate_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
   0.0%    99.4%       0.002s       3.24e-05s     61     4                     DeepCopyOp(CudaNdarrayConstant{1.0})
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=() 
   0.0%    99.4%       0.002s       3.13e-05s     61   177                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.4%       0.002s       3.02e-05s     61   267                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.4%       0.002s       2.95e-05s     61   179                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.0%    99.4%       0.002s       2.76e-05s     61     0                     HostFromGpu(shared_None)
    input 0: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 264 Apply instances account for 0.58%(0.06s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 18KB (18KB)
        GPU: 3168KB (3653KB)
        CPU + GPU: 3185KB (3671KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 18KB (18KB)
        GPU: 3519KB (4327KB)
        CPU + GPU: 3537KB (4345KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 37KB
        GPU: 5180KB
        CPU + GPU: 5217KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        836280B  [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
        720000B  [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
        720000B  [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0)
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        368640B  [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
        368640B  [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
        368640B  [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan}.2, ScalarFromTensor.0)
        360000B  [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        360000B  [(900, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
   ... (remaining 264 Apply account for 8196678B/20973438B ((39.08%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 61 calls of the op (for a total of 732 steps) 5.235906e-01s

  Total time spent in calling the VM 5.032728e-01s (96.120%)
  Total overhead (computing slices..) 2.031779e-02s (3.880%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  54.6%    54.6%       0.153s       5.23e-05s     C     2928       4   theano.sandbox.cuda.blas.GpuGemm
  42.1%    96.7%       0.118s       2.02e-05s     C     5856       8   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.3%   100.0%       0.009s       3.15e-06s     C     2928       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  54.6%    54.6%       0.153s       5.23e-05s     C     2928        4   GpuGemm{no_inplace}
  11.6%    66.2%       0.033s       2.22e-05s     C     1464        2   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
  10.4%    76.6%       0.029s       2.00e-05s     C     1464        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
  10.1%    86.7%       0.028s       1.93e-05s     C     1464        2   GpuElemwise{mul,no_inplace}
  10.0%    96.7%       0.028s       1.92e-05s     C     1464        2   GpuElemwise{sub,no_inplace}
   1.8%    98.5%       0.005s       3.36e-06s     C     1464        2   GpuSubtensor{::, :int64:}
   1.5%   100.0%       0.004s       2.93e-06s     C     1464        2   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  13.8%    13.8%       0.039s       5.31e-05s    732     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
  13.6%    27.5%       0.038s       5.22e-05s    732     3                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
  13.6%    41.0%       0.038s       5.20e-05s    732    12                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
  13.6%    54.6%       0.038s       5.20e-05s    732    13                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.9%    60.5%       0.016s       2.25e-05s    732    14                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(75, 1), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.7%    66.2%       0.016s       2.20e-05s    732    15                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(75, 1), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.3%    71.5%       0.015s       2.02e-05s    732     4                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   5.2%    76.7%       0.015s       2.00e-05s    732     0                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 1), strides=c 
   5.2%    81.9%       0.015s       1.98e-05s    732     5                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   5.0%    86.9%       0.014s       1.93e-05s    732    10                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   5.0%    91.9%       0.014s       1.93e-05s    732    11                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   4.8%    96.7%       0.013s       1.83e-05s    732     2                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(75, 1), strides=c 
    output 0: dtype=float32, shape=(75, 1), strides=c 
   0.9%    97.6%       0.003s       3.44e-06s    732     8                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.9%    98.5%       0.002s       3.29e-06s    732     6                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.8%    99.3%       0.002s       3.11e-06s    732     7                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   0.7%   100.0%       0.002s       2.75e-06s    732     9                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 147KB (206KB)
        CPU + GPU: 147KB (206KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 147KB (206KB)
        CPU + GPU: 147KB (206KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 294KB
        CPU + GPU: 294KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         60000B  [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
         60000B  [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
         60000B  [(75, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
         60000B  [(75, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
         30000B  [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
         30000B  [(75, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
         30000B  [(75, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
   ... (remaining 2 Apply account for  600B/540600B ((0.11%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( generator_generate_scan )
 ==================
  Message: None
  Time in 61 calls of the op (for a total of 915 steps) 2.016554e+00s

  Total time spent in calling the VM 1.933907e+00s (95.902%)
  Total overhead (computing slices..) 8.264709e-02s (4.098%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  27.2%    27.2%       0.275s       2.31e-05s     C    11895      13   theano.sandbox.cuda.basic_ops.GpuElemwise
  20.7%    47.9%       0.209s       4.58e-05s     C     4575       5   theano.sandbox.cuda.blas.GpuDot22
  20.3%    68.3%       0.205s       4.49e-05s     C     4575       5   theano.sandbox.cuda.blas.GpuGemm
  10.4%    78.7%       0.105s       2.29e-05s     C     4575       5   theano.sandbox.cuda.basic_ops.GpuCAReduce
   4.0%    82.7%       0.041s       4.47e-05s     C      915       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   3.8%    86.5%       0.038s       2.09e-05s     C     1830       2   theano.sandbox.cuda.basic_ops.HostFromGpu
   3.4%    89.9%       0.034s       3.75e-05s     C      915       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   2.3%    92.2%       0.024s       2.58e-05s     C      915       1   theano.tensor.basic.MaxAndArgmax
   1.4%    93.7%       0.014s       2.26e-06s     C     6405       7   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.3%    95.0%       0.014s       1.48e-05s     C      915       1   theano.sandbox.multinomial.MultinomialFromUniform
   1.2%    96.2%       0.012s       1.35e-05s     C      915       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.0%    97.2%       0.010s       2.21e-06s     C     4575       5   theano.compile.ops.Shape_i
   0.9%    98.1%       0.009s       3.13e-06s     C     2745       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.7%    98.7%       0.007s       1.84e-06s     C     3660       4   theano.tensor.opt.MakeVector
   0.6%    99.3%       0.006s       3.25e-06s     C     1830       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.4%    99.7%       0.004s       2.12e-06s     C     1830       2   theano.tensor.elemwise.Elemwise
   0.3%   100.0%       0.003s       3.20e-06s     C      915       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  20.7%    20.7%       0.209s       4.58e-05s     C     4575        5   GpuDot22
  20.3%    41.1%       0.205s       4.49e-05s     C     4575        5   GpuGemm{inplace}
   5.4%    46.5%       0.055s       3.01e-05s     C     1830        2   GpuElemwise{mul,no_inplace}
   4.0%    50.6%       0.041s       4.47e-05s     C      915        1   GpuAdvancedSubtensor1
   3.8%    54.3%       0.038s       2.09e-05s     C     1830        2   HostFromGpu
   3.4%    57.7%       0.034s       3.75e-05s     C      915        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
   2.7%    60.4%       0.027s       2.98e-05s     C      915        1   GpuElemwise{add,no_inplace}
   2.6%    63.0%       0.026s       2.83e-05s     C      915        1   GpuCAReduce{add}{1,0,0}
   2.4%    65.4%       0.024s       2.62e-05s     C      915        1   GpuCAReduce{maximum}{1,0}
   2.3%    67.7%       0.024s       2.58e-05s     C      915        1   MaxAndArgmax
   2.3%    70.0%       0.023s       2.57e-05s     C      915        1   GpuElemwise{Tanh}[(0, 0)]
   2.1%    72.2%       0.022s       2.36e-05s     C      915        1   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
   2.0%    74.2%       0.021s       2.25e-05s     C      915        1   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   1.9%    76.1%       0.019s       2.06e-05s     C      915        1   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   1.9%    77.9%       0.019s       2.05e-05s     C      915        1   GpuCAReduce{maximum}{0,1}
   1.8%    79.8%       0.019s       2.03e-05s     C      915        1   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   1.8%    81.6%       0.018s       2.00e-05s     C      915        1   GpuElemwise{TrueDiv}[(0, 0)]
   1.8%    83.4%       0.018s       2.00e-05s     C      915        1   GpuCAReduce{add}{1,0}
   1.8%    85.2%       0.018s       1.99e-05s     C      915        1   GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)]
   1.8%    87.0%       0.018s       1.99e-05s     C      915        1   GpuElemwise{Add}[(0, 1)]
   ... (remaining 21 Ops account for  13.00%(0.13s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   6.7%     6.7%       0.067s       7.36e-05s    915    47                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(900, 100), strides=c 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(900, 1), strides=c 
   4.4%    11.1%       0.045s       4.87e-05s    915    39                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   4.4%    15.4%       0.044s       4.83e-05s    915    11                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   4.3%    19.8%       0.044s       4.76e-05s    915     9                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 44), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 200), strides=c 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   4.0%    23.8%       0.041s       4.47e-05s    915    30                     GpuAdvancedSubtensor1(W_copy[cuda], argmax)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(75,), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.8%    27.6%       0.039s       4.24e-05s    915     1                     GpuDot22(generator_initial_states_states[t-1][cuda], W_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   3.6%    31.3%       0.037s       4.01e-05s    915    40                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.6%    34.9%       0.037s       4.00e-05s    915    57                     GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
    input 0: dtype=float32, shape=(12, 75, 1), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=c 
    output 0: dtype=float32, shape=(12, 75, 200), strides=c 
   3.6%    38.5%       0.036s       3.99e-05s    915    33                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(75, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   3.5%    42.0%       0.035s       3.84e-05s    915     6                     GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   3.4%    45.4%       0.034s       3.75e-05s    915    14                     GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(92160,), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(92160,), strides=c 
    output 1: dtype=float32, shape=(75,), strides=c 
   3.4%    48.8%       0.034s       3.73e-05s    915    38                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   3.4%    52.1%       0.034s       3.73e-05s    915    42                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
    input 0: dtype=float32, shape=(75, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   2.7%    54.8%       0.027s       2.98e-05s    915    44                     GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 75, 100), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=c 
   2.6%    57.4%       0.026s       2.83e-05s    915    58                     GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
    input 0: dtype=float32, shape=(12, 75, 200), strides=c 
    output 0: dtype=float32, shape=(75, 200), strides=c 
   2.4%    59.8%       0.024s       2.62e-05s    915    49                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 75), strides=c 
    output 0: dtype=float32, shape=(75,), strides=c 
   2.3%    62.1%       0.024s       2.58e-05s    915    28                     MaxAndArgmax(MultinomialFromUniform{int64}.0, TensorConstant{(1,) of 1})
    input 0: dtype=int64, shape=(75, 44), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=int64, shape=(75,), strides=c 
    output 1: dtype=int64, shape=(75,), strides=c 
   2.3%    64.4%       0.023s       2.57e-05s    915    46                     GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
    input 0: dtype=float32, shape=(900, 100), strides=c 
    output 0: dtype=float32, shape=(900, 100), strides=c 
   2.1%    66.6%       0.022s       2.36e-05s    915    26                     HostFromGpu(GpuElemwise{Composite{exp((i0 - i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(75, 44), strides=c 
    output 0: dtype=float32, shape=(75, 44), strides=c 
   2.1%    68.7%       0.022s       2.36e-05s    915    41                     GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(75, 100), strides=c 
    input 2: dtype=float32, shape=(75, 100), strides=c 
    input 3: dtype=float32, shape=(75, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(75, 100), strides=c 
   ... (remaining 39 Apply instances account for 31.29%(0.32s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 39KB (39KB)
        GPU: 1151KB (1151KB)
        CPU + GPU: 1190KB (1190KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 39KB (39KB)
        GPU: 1151KB (1151KB)
        CPU + GPU: 1190KB (1190KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 41KB
        GPU: 1709KB
        CPU + GPU: 1750KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        720000B  [(12, 75, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace[cuda])
        368940B  [(92160,), (75,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
        360000B  [(12, 75, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace[cuda])
        360000B  [(900, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
        360000B  [(900, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
        360000B  [(12, 75, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
         60000B  [(75, 200)] c GpuDot22(generator_initial_states_states[t-1][cuda], state_to_gates_copy[cuda])
         60000B  [(75, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
         60000B  [(75, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
         60000B  [(75, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
         60000B  [(75, 200)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] c GpuElemwise{mul,no_inplace}(generator_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
         30000B  [(75, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
         30000B  [(75, 100)] c GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy[cuda])
         30000B  [(75, 100)] c GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}(<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, generator_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
         30000B  [(75, 100)] i GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1][cuda], W_copy[cuda], TensorConstant{1.0})
         30000B  [(75, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)].0, Constant{100})
         30000B  [(75, 100)] c GpuAdvancedSubtensor1(W_copy[cuda], argmax)
         30000B  [(75, 100)] c GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda])
   ... (remaining 39 Apply account for 188879B/3287819B ((5.74%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:103
  Time in 11 calls to Function.__call__: 1.403611e-01s
  Time in Function.fn.__call__: 1.400502e-01s (99.779%)
  Time in thunks: 9.480190e-02s (67.541%)
  Total compile time: 6.756872e+01s
    Number of Apply nodes: 190
    Theano Optimizer time: 4.246896e+00s
       Theano validate time: 1.580198e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.792800e+01s
       Import time 1.193612e-01s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.896s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  89.0%    89.0%       0.084s       3.84e-03s     Py      22       2   theano.scan_module.scan_op.Scan
   3.0%    92.0%       0.003s       2.59e-06s     C     1089      99   theano.tensor.elemwise.Elemwise
   1.9%    93.9%       0.002s       4.05e-05s     C       44       4   theano.sandbox.cuda.blas.GpuDot22
   0.9%    94.8%       0.001s       1.99e-05s     C       44       4   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.7%    95.5%       0.001s       6.24e-05s     C       11       1   theano.sandbox.cuda.basic_ops.GpuJoin
   0.6%    96.1%       0.001s       2.50e-05s     C       22       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.5%    96.7%       0.001s       4.63e-05s     C       11       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   0.5%    97.1%       0.000s       2.08e-05s     C       22       2   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.4%    97.6%       0.000s       3.18e-06s     C      132      12   theano.sandbox.cuda.basic_ops.GpuReshape
   0.4%    98.0%       0.000s       2.77e-06s     C      143      13   theano.compile.ops.Shape_i
   0.4%    98.4%       0.000s       2.77e-06s     C      143      13   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    98.8%       0.000s       2.48e-06s     C      132      12   theano.tensor.opt.MakeVector
   0.3%    99.1%       0.000s       2.21e-06s     C      132      12   theano.tensor.basic.ScalarFromTensor
   0.3%    99.3%       0.000s       2.39e-05s     C       11       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.3%    99.6%       0.000s       3.71e-06s     C       66       6   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.2%    99.8%       0.000s       6.49e-06s     Py      22       2   theano.compile.ops.Rebroadcast
   0.1%    99.9%       0.000s       6.40e-06s     C       22       2   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   0.1%   100.0%       0.000s       5.25e-06s     C       11       1   theano.tensor.basic.Alloc
   0.0%   100.0%       0.000s       3.21e-06s     C       11       1   theano.tensor.basic.Reshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  89.0%    89.0%       0.084s       3.84e-03s     Py      22        2   forall_inplace,gpu,gatedrecurrent_apply_scan}
   1.9%    90.9%       0.002s       4.05e-05s     C       44        4   GpuDot22
   0.9%    91.8%       0.001s       1.99e-05s     C       44        4   GpuElemwise{Add}[(0, 0)]
   0.7%    92.6%       0.001s       6.24e-05s     C       11        1   GpuJoin
   0.6%    93.1%       0.001s       2.50e-05s     C       22        2   GpuAlloc
   0.5%    93.7%       0.001s       4.63e-05s     C       11        1   GpuAdvancedSubtensor1
   0.5%    94.2%       0.000s       2.08e-05s     C       22        2   GpuIncSubtensor{InplaceSet;:int64:}
   0.3%    94.5%       0.000s       2.48e-06s     C      132       12   MakeVector{dtype='int64'}
   0.3%    94.8%       0.000s       2.21e-06s     C      132       12   ScalarFromTensor
   0.3%    95.1%       0.000s       2.39e-05s     C       11        1   HostFromGpu
   0.3%    95.4%       0.000s       2.57e-06s     C       99        9   Elemwise{add,no_inplace}
   0.3%    95.6%       0.000s       3.19e-06s     C       77        7   GpuReshape{2}
   0.2%    95.9%       0.000s       2.97e-06s     C       77        7   Shape_i{0}
   0.2%    96.1%       0.000s       2.43e-06s     C       88        8   Elemwise{le,no_inplace}
   0.2%    96.3%       0.000s       2.83e-06s     C       66        6   GpuDimShuffle{x,x,0}
   0.2%    96.5%       0.000s       3.16e-06s     C       55        5   GpuReshape{3}
   0.2%    96.6%       0.000s       2.55e-06s     C       66        6   Shape_i{1}
   0.2%    96.8%       0.000s       2.50e-06s     C       66        6   Elemwise{sub,no_inplace}
   0.2%    97.0%       0.000s       6.49e-06s     Py      22        2   Rebroadcast{0}
   0.1%    97.1%       0.000s       6.40e-06s     C       22        2   GpuAllocEmpty
   ... (remaining 56 Ops account for   2.88%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  45.0%    45.0%       0.043s       3.88e-03s     11   140                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
  44.0%    89.0%       0.042s       3.79e-03s     11   182                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.7%    89.8%       0.001s       6.24e-05s     11   188                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.5%    90.3%       0.001s       4.63e-05s     11    31                     GpuAdvancedSubtensor1(W, Reshape{1}.0)
    input 0: dtype=float32, shape=(44, 100), strides=c 
    input 1: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.5%    90.8%       0.000s       4.20e-05s     11    58                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
   0.5%    91.2%       0.000s       4.02e-05s     11    55                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.5%    91.7%       0.000s       4.00e-05s     11    57                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.5%    92.2%       0.000s       3.98e-05s     11    56                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
   0.3%    92.5%       0.000s       2.64e-05s     11   103                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.3%    92.8%       0.000s       2.39e-05s     11   189                     HostFromGpu(GpuJoin.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=c 
   0.3%    93.0%       0.000s       2.36e-05s     11    71                     GpuAlloc(GpuDimShuffle{x,x,0}.0, TensorConstant{1}, gatedrecurrent_initial_states_batch_size, Shape_i{0}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    input 3: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.2%    93.3%       0.000s       2.14e-05s     11   137                     GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    93.5%       0.000s       2.01e-05s     11   167                     GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    93.7%       0.000s       2.01e-05s     11    78                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    94.0%       0.000s       1.99e-05s     11    79                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.2%    94.2%       0.000s       1.99e-05s     11    80                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.2%    94.4%       0.000s       1.97e-05s     11    81                     GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 200), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   0.1%    94.5%       0.000s       7.91e-06s     11   124                     GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0, Elemwise{Composite{Switch(EQ(i0, i1), i2, i0)}}[(0, 0)].0)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.1%    94.6%       0.000s       6.59e-06s     11   132                     Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   0.1%    94.7%       0.000s       6.39e-06s     11    98                     Rebroadcast{0}(GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   ... (remaining 170 Apply instances account for 5.32%(0.01s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 9KB (9KB)
        GPU: 28KB (34KB)
        CPU + GPU: 38KB (43KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 9KB (9KB)
        GPU: 33KB (38KB)
        CPU + GPU: 42KB (48KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 10KB
        GPU: 52KB
        CPU + GPU: 63KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         80000B  [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
         80000B  [(100, 200)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
         80000B  [(100, 200)] v GpuDimShuffle{0,1}(W)
         80000B  [(100, 200)] v GpuDimShuffle{0,1}(W)
         40000B  [(100, 100)] v GpuDimShuffle{0,1}(W)
         40000B  [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
         40000B  [(100, 100)] v GpuDimShuffle{0,1}(W)
         40000B  [(100, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          9600B  [(12, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          9600B  [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
          9600B  [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
          9600B  [(12, 1, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
          9600B  [(12, 1, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
          9600B  [(12, 1, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
          9600B  [(12, 1, 200)] c HostFromGpu(GpuJoin.0)
          4800B  [(12, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 1, 100)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
   ... (remaining 170 Apply account for 94077B/679677B ((13.84%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 11 calls of the op (for a total of 132 steps) 4.214001e-02s

  Total time spent in calling the VM 4.016280e-02s (95.308%)
  Total overhead (computing slices..) 1.977205e-03s (4.692%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  61.2%    61.2%       0.013s       4.91e-05s     C      264       2   theano.sandbox.cuda.blas.GpuGemm
  34.9%    96.2%       0.007s       1.87e-05s     C      396       3   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.8%   100.0%       0.001s       3.07e-06s     C      264       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  61.2%    61.2%       0.013s       4.91e-05s     C      264        2   GpuGemm{no_inplace}
  11.9%    73.1%       0.003s       1.90e-05s     C      132        1   GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
  11.6%    84.7%       0.002s       1.86e-05s     C      132        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
  11.5%    96.2%       0.002s       1.84e-05s     C      132        1   GpuElemwise{mul,no_inplace}
   2.1%    98.3%       0.000s       3.34e-06s     C      132        1   GpuSubtensor{::, :int64:}
   1.7%   100.0%       0.000s       2.80e-06s     C      132        1   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  31.4%    31.4%       0.007s       5.04e-05s    132     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  29.8%    61.2%       0.006s       4.79e-05s    132     5                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  11.9%    73.1%       0.003s       1.90e-05s    132     6                     GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  11.6%    84.7%       0.002s       1.86e-05s    132     1                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  11.5%    96.2%       0.002s       1.84e-05s    132     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   2.1%    98.3%       0.000s       3.34e-06s    132     2                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   1.7%   100.0%       0.000s       2.80e-06s    132     3                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 11 calls of the op (for a total of 132 steps) 4.123449e-02s

  Total time spent in calling the VM 3.931022e-02s (95.333%)
  Total overhead (computing slices..) 1.924276e-03s (4.667%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  61.1%    61.1%       0.013s       4.84e-05s     C      264       2   theano.sandbox.cuda.blas.GpuGemm
  35.1%    96.2%       0.007s       1.85e-05s     C      396       3   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.8%   100.0%       0.001s       3.01e-06s     C      264       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  61.1%    61.1%       0.013s       4.84e-05s     C      264        2   GpuGemm{no_inplace}
  12.1%    73.2%       0.003s       1.92e-05s     C      132        1   GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}
  11.6%    84.8%       0.002s       1.84e-05s     C      132        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
  11.4%    96.2%       0.002s       1.80e-05s     C      132        1   GpuElemwise{mul,no_inplace}
   2.1%    98.3%       0.000s       3.31e-06s     C      132        1   GpuSubtensor{::, :int64:}
   1.7%   100.0%       0.000s       2.72e-06s     C      132        1   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  31.3%    31.3%       0.007s       4.96e-05s    132     0                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1][cuda], state_to_gates_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  29.8%    61.1%       0.006s       4.72e-05s    132     5                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  12.1%    73.2%       0.003s       1.92e-05s    132     6                     GpuElemwise{Composite{((tanh(i0) * i1) + (i2 * (i3 - i1)))},no_inplace}(GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1][cuda], CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=c 
    input 3: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  11.6%    84.8%       0.002s       1.84e-05s    132     1                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
  11.4%    96.2%       0.002s       1.80e-05s    132     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1][cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    input 1: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   2.1%    98.3%       0.000s       3.31e-06s    132     2                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   1.7%   100.0%       0.000s       2.72e-06s    132     3                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 2KB (2KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 7 Apply account for 3600B/3600B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:111
  Time in 11 calls to Function.__call__: 2.407074e-03s
  Time in Function.fn.__call__: 2.131939e-03s (88.570%)
  Time in thunks: 4.451275e-04s (18.492%)
  Total compile time: 2.637064e+01s
    Number of Apply nodes: 8
    Theano Optimizer time: 8.200908e-02s
       Theano validate time: 1.047134e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.873722e+01s
       Import time 9.109974e-03s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.952s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  45.8%    45.8%       0.000s       1.85e-05s     C       11       1   theano.sandbox.cuda.basic_ops.HostFromGpu
  19.9%    65.7%       0.000s       4.02e-06s     C       22       2   theano.tensor.basic.Alloc
  15.8%    81.5%       0.000s       3.20e-06s     C       22       2   theano.compile.ops.Shape_i
   6.9%    88.3%       0.000s       2.77e-06s     C       11       1   theano.sandbox.cuda.basic_ops.GpuReshape
   6.2%    94.5%       0.000s       2.49e-06s     C       11       1   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   5.5%   100.0%       0.000s       2.23e-06s     C       11       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  45.8%    45.8%       0.000s       1.85e-05s     C       11        1   HostFromGpu
  19.9%    65.7%       0.000s       4.02e-06s     C       22        2   Alloc
  15.8%    81.5%       0.000s       3.20e-06s     C       22        2   Shape_i{0}
   6.9%    88.3%       0.000s       2.77e-06s     C       11        1   GpuReshape{2}
   6.2%    94.5%       0.000s       2.49e-06s     C       11        1   GpuDimShuffle{x,x,0}
   5.5%   100.0%       0.000s       2.23e-06s     C       11        1   MakeVector{dtype='int64'}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  45.8%    45.8%       0.000s       1.85e-05s     11     7                     HostFromGpu(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 100), strides=c 
  10.6%    56.4%       0.000s       4.29e-06s     11     4                     Alloc(TensorConstant{0.0}, TensorConstant{1}, Shape_i{0}.0)
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 12), strides=c 
   9.4%    65.8%       0.000s       3.79e-06s     11     0                     Shape_i{0}(generator_generate_attended)
    input 0: dtype=float32, shape=(12, 1, 200), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   9.3%    75.0%       0.000s       3.75e-06s     11     1                     Alloc(TensorConstant{0.0}, TensorConstant{1}, TensorConstant{200})
    input 0: dtype=float32, shape=(), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int16, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
   6.9%    81.9%       0.000s       2.77e-06s     11     6                     GpuReshape{2}(GpuDimShuffle{x,x,0}.0, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 1: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   6.4%    88.3%       0.000s       2.60e-06s     11     2                     Shape_i{0}(initial_state)
    input 0: dtype=float32, shape=(100,), strides=c 
    output 0: dtype=int64, shape=(), strides=c 
   6.2%    94.5%       0.000s       2.49e-06s     11     3                     GpuDimShuffle{x,x,0}(initial_state)
    input 0: dtype=float32, shape=(100,), strides=c 
    output 0: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
   5.5%   100.0%       0.000s       2.23e-06s     11     5                     MakeVector{dtype='int64'}(TensorConstant{1}, Shape_i{0}.0)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=int64, shape=(), strides=c 
    output 0: dtype=int64, shape=(2,), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 1KB (1KB)
        GPU: 0KB (0KB)
        CPU + GPU: 1KB (1KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 1KB (1KB)
        GPU: 0KB (0KB)
        CPU + GPU: 1KB (1KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 1KB
        GPU: 0KB
        CPU + GPU: 1KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 8 Apply account for 2080B/2080B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:126
  Time in 176 calls to Function.__call__: 4.303689e-01s
  Time in Function.fn.__call__: 4.239194e-01s (98.501%)
  Time in thunks: 1.613367e-01s (37.488%)
  Total compile time: 9.262143e+00s
    Number of Apply nodes: 79
    Theano Optimizer time: 5.638268e-01s
       Theano validate time: 2.706265e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.979231e-01s
       Import time 1.633863e-01s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.954s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  21.9%    21.9%       0.035s       4.02e-05s     C      880       5   theano.sandbox.cuda.blas.GpuDot22
  20.7%    42.7%       0.033s       1.90e-05s     C     1760      10   theano.sandbox.cuda.basic_ops.GpuElemwise
  17.9%    60.6%       0.029s       4.10e-05s     C      704       4   theano.sandbox.cuda.blas.GpuGemm
   7.3%    67.8%       0.012s       1.66e-05s     C      704       4   theano.sandbox.cuda.basic_ops.HostFromGpu
   7.2%    75.0%       0.012s       2.19e-05s     C      528       3   theano.sandbox.cuda.basic_ops.GpuCAReduce
   7.1%    82.1%       0.012s       1.31e-05s     C      880       5   theano.sandbox.cuda.basic_ops.GpuFromHost
   4.8%    86.9%       0.008s       4.41e-05s     C      176       1   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   3.5%    90.4%       0.006s       2.68e-06s     C     2112      12   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   2.2%    92.7%       0.004s       2.29e-06s     C     1584       9   theano.tensor.elemwise.Elemwise
   2.2%    94.9%       0.004s       2.94e-06s     C     1232       7   theano.sandbox.cuda.basic_ops.GpuReshape
   2.1%    97.1%       0.003s       2.45e-06s     C     1408       8   theano.compile.ops.Shape_i
   1.7%    98.8%       0.003s       2.28e-06s     C     1232       7   theano.tensor.opt.MakeVector
   0.7%    99.5%       0.001s       3.15e-06s     C      352       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.3%    99.8%       0.000s       2.45e-06s     C      176       1   theano.tensor.elemwise.All
   0.2%   100.0%       0.000s       2.15e-06s     C      176       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  21.9%    21.9%       0.035s       4.02e-05s     C      880        5   GpuDot22
  17.9%    39.8%       0.029s       4.10e-05s     C      704        4   GpuGemm{inplace}
   7.3%    47.1%       0.012s       1.66e-05s     C      704        4   HostFromGpu
   7.1%    54.2%       0.012s       1.31e-05s     C      880        5   GpuFromHost
   4.8%    59.0%       0.008s       4.41e-05s     C      176        1   GpuAdvancedSubtensor1
   2.6%    61.7%       0.004s       2.42e-05s     C      176        1   GpuCAReduce{maximum}{1,0}
   2.3%    64.0%       0.004s       2.12e-05s     C      176        1   GpuCAReduce{add}{1,0,0}
   2.2%    66.2%       0.004s       2.02e-05s     C      176        1   GpuCAReduce{add}{1,0}
   2.2%    68.4%       0.003s       1.98e-05s     C      176        1   GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)]
   2.1%    70.5%       0.003s       1.96e-05s     C      176        1   GpuElemwise{mul,no_inplace}
   2.1%    72.6%       0.003s       1.94e-05s     C      176        1   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)]
   2.1%    74.7%       0.003s       1.93e-05s     C      176        1   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   2.1%    76.8%       0.003s       1.92e-05s     C      176        1   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   2.1%    78.9%       0.003s       1.92e-05s     C      176        1   GpuElemwise{Mul}[(0, 1)]
   2.0%    80.9%       0.003s       1.86e-05s     C      176        1   GpuElemwise{Add}[(0, 0)]
   2.0%    83.0%       0.003s       1.86e-05s     C      176        1   GpuElemwise{TrueDiv}[(0, 0)]
   2.0%    85.0%       0.003s       1.83e-05s     C      176        1   GpuElemwise{Sub}[(0, 1)]
   2.0%    86.9%       0.003s       1.82e-05s     C      176        1   GpuElemwise{Tanh}[(0, 0)]
   1.9%    88.8%       0.003s       2.89e-06s     C     1056        6   GpuReshape{2}
   1.7%    90.6%       0.003s       2.28e-06s     C     1232        7   MakeVector{dtype='int64'}
   ... (remaining 23 Ops account for   9.43%(0.02s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.4%     5.4%       0.009s       4.94e-05s    176    37                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   4.9%    10.3%       0.008s       4.54e-05s    176    29                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   4.9%    15.2%       0.008s       4.49e-05s    176    48                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.8%    20.1%       0.008s       4.41e-05s    176    12                     GpuAdvancedSubtensor1(W, readout_sample_samples)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(1,), strides=(16,) 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.3%    24.3%       0.007s       3.92e-05s    176    24                     GpuDot22(GpuFromHost.0, state_to_gates)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   4.1%    28.4%       0.007s       3.77e-05s    176    47                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.1%    32.6%       0.007s       3.77e-05s    176    57                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   4.0%    36.6%       0.007s       3.70e-05s    176    51                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)].0, W)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.0%    40.6%       0.006s       3.69e-05s    176    49                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   4.0%    44.6%       0.006s       3.68e-05s    176    33                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
   2.6%    47.3%       0.004s       2.42e-05s    176    59                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   2.3%    49.6%       0.004s       2.12e-05s    176    77                     GpuCAReduce{add}{1,0,0}(GpuElemwise{Mul}[(0, 1)].0)
    input 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   2.2%    51.8%       0.004s       2.02e-05s    176    63                     GpuCAReduce{add}{1,0}(GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   2.2%    53.9%       0.003s       1.98e-05s    176    54                     GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 1: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    input 2: dtype=float32, shape=(1, 1, 100), strides=(0, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   2.1%    56.1%       0.003s       1.96e-05s    176    46                     GpuElemwise{mul,no_inplace}(GpuFromHost.0, GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.1%    58.2%       0.003s       1.94e-05s    176    50                     GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, GpuFromHost.0, CudaNdarrayConstant{[[ 1.]]})
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 2: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 3: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   2.1%    60.3%       0.003s       1.93e-05s    176    38                     GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    input 1: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=c 
   2.1%    62.4%       0.003s       1.92e-05s    176    61                     GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)](GpuReshape{2}.0, GpuDimShuffle{x,0}.0, GpuFromHost.0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    input 2: dtype=float32, shape=(12, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   2.1%    64.5%       0.003s       1.92e-05s    176    76                     GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
    input 0: dtype=float32, shape=(12, 1, 1), strides=(1, 0, 0) 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    output 0: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
   2.0%    66.5%       0.003s       1.86e-05s    176    72                     GpuElemwise{TrueDiv}[(0, 0)](GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)].0, GpuElemwise{Add}[(0, 0)].0)
    input 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    output 0: dtype=float32, shape=(12, 1), strides=(1, 0) 
   ... (remaining 59 Apply instances account for 33.48%(0.05s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 1KB (1KB)
        GPU: 14KB (16KB)
        CPU + GPU: 15KB (18KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 1KB (1KB)
        GPU: 14KB (16KB)
        CPU + GPU: 15KB (18KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 1KB
        GPU: 18KB
        CPU + GPU: 20KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         80000B  [(200, 100)] v GpuDimShuffle{0,1}(W)
         80000B  [(200, 100)] v GpuReshape{2}(GpuDimShuffle{0,1}.0, MakeVector{dtype='int64'}.0)
          9600B  [(12, 1, 200)] v GpuDimShuffle{0,1,2}(GpuFromHost.0)
          9600B  [(12, 1, 200)] i GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{0,1,x}.0, GpuFromHost.0)
          9600B  [(12, 1, 200)] c GpuFromHost(generator_generate_attended)
          9600B  [(12, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 1, 100)] i GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)](GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,x,0}.0, GpuDimShuffle{x,0,1}.0)
          4800B  [(12, 1, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0)
          4800B  [(12, 1, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 100)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
          4800B  [(12, 100)] v GpuReshape{2}(GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 0)].0, MakeVector{dtype='int64'}.0)
          4800B  [(12, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
   ... (remaining 67 Apply account for 13955B/241155B ((5.79%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks-extras/blocks_extras/beam_search.py:137
  Time in 176 calls to Function.__call__: 1.020806e-01s
  Time in Function.fn.__call__: 9.711361e-02s (95.134%)
  Time in thunks: 4.424906e-02s (43.347%)
  Total compile time: 5.741991e+00s
    Number of Apply nodes: 14
    Theano Optimizer time: 1.551719e-01s
       Theano validate time: 3.836393e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 6.299686e-02s
       Import time 3.151894e-02s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.967s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  29.7%    29.7%       0.013s       1.87e-05s     C      704       4   theano.sandbox.cuda.basic_ops.GpuElemwise
  18.0%    47.7%       0.008s       4.51e-05s     C      176       1   theano.sandbox.cuda.blas.GpuGemm
  16.2%    63.9%       0.007s       2.04e-05s     C      352       2   theano.sandbox.cuda.basic_ops.GpuCAReduce
  15.7%    79.6%       0.007s       3.94e-05s     C      176       1   theano.sandbox.cuda.blas.GpuDot22
  10.4%    90.0%       0.005s       1.31e-05s     C      352       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   6.8%    96.8%       0.003s       1.71e-05s     C      176       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   3.2%   100.0%       0.001s       2.67e-06s     C      528       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  18.0%    18.0%       0.008s       4.51e-05s     C      176        1   GpuGemm{inplace}
  15.7%    33.6%       0.007s       3.94e-05s     C      176        1   GpuDot22
  10.4%    44.0%       0.005s       1.31e-05s     C      352        2   GpuFromHost
   8.4%    52.4%       0.004s       2.10e-05s     C      176        1   GpuCAReduce{maximum}{0,1}
   8.0%    60.4%       0.004s       2.00e-05s     C      176        1   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   7.9%    68.2%       0.003s       1.98e-05s     C      176        1   GpuCAReduce{add}{0,1}
   7.4%    75.6%       0.003s       1.86e-05s     C      176        1   GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
   7.4%    83.0%       0.003s       1.86e-05s     C      176        1   GpuElemwise{Add}[(0, 1)]
   7.0%    90.0%       0.003s       1.76e-05s     C      176        1   GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)]
   6.8%    96.8%       0.003s       1.71e-05s     C      176        1   HostFromGpu
   2.0%    98.8%       0.001s       2.55e-06s     C      352        2   GpuDimShuffle{0,x}
   1.2%   100.0%       0.001s       2.90e-06s     C      176        1   GpuDimShuffle{x,0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  18.0%    18.0%       0.008s       4.51e-05s    176     4                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuFromHost.0, W, TensorConstant{1.0})
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
  15.7%    33.6%       0.007s       3.94e-05s    176     3                     GpuDot22(GpuFromHost.0, W)
    input 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   8.4%    42.0%       0.004s       2.10e-05s    176     6                     GpuCAReduce{maximum}{0,1}(GpuElemwise{Add}[(0, 1)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   8.0%    49.9%       0.004s       2.00e-05s    176     8                     GpuElemwise{Composite{exp((i0 - i1))},no_inplace}(GpuElemwise{Add}[(0, 1)].0, GpuDimShuffle{0,x}.0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   7.9%    57.8%       0.003s       1.98e-05s    176     9                     GpuCAReduce{add}{0,1}(GpuElemwise{Composite{exp((i0 - i1))},no_inplace}.0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1,), strides=(0,) 
   7.4%    65.2%       0.003s       1.86e-05s    176    11                     GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)](GpuDimShuffle{0,x}.0, GpuDimShuffle{0,x}.0)
    input 0: dtype=float32, shape=(1, 1), strides=(0, 0) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    output 0: dtype=float32, shape=(1, 1), strides=(0, 0) 
   7.4%    72.6%       0.003s       1.86e-05s    176     5                     GpuElemwise{Add}[(0, 1)](GpuDimShuffle{x,0}.0, GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   7.0%    79.6%       0.003s       1.76e-05s    176    12                     GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)](GpuElemwise{Add}[(0, 1)].0, GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=(0, 0) 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   6.8%    86.4%       0.003s       1.71e-05s    176    13                     HostFromGpu(GpuElemwise{Composite{(-(i0 - i1))}}[(0, 0)].0)
    input 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
    output 0: dtype=float32, shape=(1, 44), strides=c 
   6.0%    92.4%       0.003s       1.51e-05s    176     0                     GpuFromHost(generator_generate_weighted_averages)
    input 0: dtype=float32, shape=(1, 200), strides=c 
    output 0: dtype=float32, shape=(1, 200), strides=(0, 1) 
   4.4%    96.8%       0.002s       1.11e-05s    176     1                     GpuFromHost(generator_generate_states)
    input 0: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(1, 100), strides=(0, 1) 
   1.2%    98.0%       0.001s       2.90e-06s    176     2                     GpuDimShuffle{x,0}(b)
    input 0: dtype=float32, shape=(44,), strides=c 
    output 0: dtype=float32, shape=(1, 44), strides=(0, 1) 
   1.1%    99.0%       0.000s       2.65e-06s    176     7                     GpuDimShuffle{0,x}(GpuCAReduce{maximum}{0,1}.0)
    input 0: dtype=float32, shape=(1,), strides=(0,) 
    output 0: dtype=float32, shape=(1, 1), strides=(0, 0) 
   1.0%   100.0%       0.000s       2.46e-06s    176    10                     GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0)
    input 0: dtype=float32, shape=(1,), strides=(0,) 
    output 0: dtype=float32, shape=(1, 1), strides=(0, 0) 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 1KB (1KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 1KB (1KB)
        CPU + GPU: 2KB (2KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 2KB
        CPU + GPU: 2KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 14 Apply account for 2452B/2452B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:181
  Time in 1 calls to Function.__call__: 1.502037e-05s
  Time in Function.fn.__call__: 6.198883e-06s (41.270%)
  Total compile time: 5.506201e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 1.379991e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.969337e-04s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.970s
  No execution time accumulated (hint: try config profiling.time_thunks=1)
 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/monitoring/evaluators.py:286
  Time in 6075 calls to Function.__call__: 3.570089e-01s
  Time in Function.fn.__call__: 2.095068e-01s (58.684%)
  Time in thunks: 3.889871e-02s (10.896%)
  Total compile time: 7.101128e+00s
    Number of Apply nodes: 2
    Theano Optimizer time: 1.691389e-02s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.875090e-03s
       Import time 0.000000e+00s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.970s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.039s       3.20e-06s     C    12150       2   theano.compile.ops.DeepCopyOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  100.0%   100.0%       0.039s       3.20e-06s     C     12150        2   DeepCopyOp
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  59.5%    59.5%       0.023s       3.81e-06s   6075     0                     DeepCopyOp(labels)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
  40.5%   100.0%       0.016s       2.59e-06s   6075     1                     DeepCopyOp(inputs)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 0KB (0KB)
        CPU + GPU: 0KB (0KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 0KB
        CPU + GPU: 0KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

   ... (remaining 2 Apply account for  192B/192B ((100.00%)) of the Apply with dense outputs sizes)

    All Apply nodes have output sizes that take less than 1024B.
    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: /u/bahdanau/Dist/fully-neural-lvsr/libs/blocks/blocks/algorithms/__init__.py:253
  Time in 100 calls to Function.__call__: 9.018593e+01s
  Time in Function.fn.__call__: 8.999730e+01s (99.791%)
  Time in thunks: 3.194728e+01s (35.424%)
  Total compile time: 3.881262e+02s
    Number of Apply nodes: 3574
    Theano Optimizer time: 2.013044e+02s
       Theano validate time: 4.291104e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.755457e+02s
       Import time 1.107465e+01s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 830.971s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  82.1%    82.1%      26.232s       3.75e-02s     Py     700       7   theano.scan_module.scan_op.Scan
   5.7%    87.8%       1.812s       2.16e-05s     C    83700     837   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.1%    90.9%       1.002s       1.00e-02s     Py     100       1   lvsr.ops.EditDistanceOp
   2.4%    93.3%       0.761s       3.08e-05s     C    24700     247   theano.sandbox.cuda.basic_ops.GpuCAReduce
   1.0%    94.3%       0.330s       4.65e-05s     C     7100      71   theano.sandbox.cuda.blas.GpuDot22
   1.0%    95.3%       0.313s       3.64e-06s     C    86000     860   theano.tensor.elemwise.Elemwise
   0.9%    96.2%       0.291s       1.82e-05s     C    16000     160   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.5%    96.8%       0.173s       2.54e-05s     C     6800      68   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.5%    97.3%       0.166s       2.30e-05s     Py    7200      48   theano.ifelse.IfElse
   0.4%    97.7%       0.142s       2.57e-05s     C     5500      55   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.4%    98.2%       0.139s       3.32e-06s     C    41800     418   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.4%    98.6%       0.129s       7.69e-06s     C    16800     168   theano.sandbox.cuda.basic_ops.GpuReshape
   0.2%    98.8%       0.063s       2.11e-05s     C     3000      30   theano.compile.ops.DeepCopyOp
   0.1%    98.9%       0.048s       4.30e-06s     C    11100     111   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.1%    99.1%       0.048s       3.69e-06s     C    12900     129   theano.tensor.opt.MakeVector
   0.1%    99.2%       0.039s       1.68e-05s     C     2300      23   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.1%    99.3%       0.036s       3.41e-06s     C    10600     106   theano.compile.ops.Shape_i
   0.1%    99.4%       0.030s       9.88e-05s     Py     300       3   theano.sandbox.cuda.basic_ops.GpuSplit
   0.1%    99.5%       0.026s       6.61e-05s     C      400       4   theano.sandbox.cuda.basic_ops.GpuJoin
   0.1%    99.5%       0.025s       2.81e-06s     C     8800      88   theano.tensor.basic.ScalarFromTensor
   ... (remaining 24 Classes account for   0.45%(0.14s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  29.2%    29.2%       9.321s       9.32e-02s     Py     100        1   forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
  19.3%    48.5%       6.165s       6.16e-02s     Py     100        1   forall_inplace,gpu,generator_generate_scan&generator_generate_scan}
  14.4%    62.9%       4.615s       2.31e-02s     Py     200        2   forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
  11.5%    74.4%       3.680s       3.68e-02s     Py     100        1   forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
   5.0%    79.4%       1.599s       1.60e-02s     Py     100        1   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   3.1%    82.6%       1.002s       1.00e-02s     Py     100        1   EditDistanceOp
   2.7%    85.2%       0.851s       8.51e-03s     Py     100        1   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   1.0%    86.3%       0.330s       4.65e-05s     C     7100       71   GpuDot22
   1.0%    87.3%       0.319s       3.80e-05s     C     8400       84   GpuCAReduce{pre=sqr,red=add}{1,1}
   0.9%    88.2%       0.291s       1.82e-05s     C     16000      160   HostFromGpu
   0.6%    88.8%       0.207s       2.13e-05s     C     9700       97   GpuElemwise{add,no_inplace}
   0.5%    89.4%       0.172s       2.20e-05s     C     7800       78   GpuElemwise{sub,no_inplace}
   0.5%    89.9%       0.171s       3.57e-05s     C     4800       48   GpuCAReduce{add}{1,1}
   0.5%    90.4%       0.155s       2.38e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
   0.5%    90.9%       0.154s       2.49e-05s     Py    6200       39   if{gpu}
   0.5%    91.3%       0.150s       2.34e-05s     C     6400       64   GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
   0.4%    91.8%       0.134s       2.05e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
   0.4%    92.2%       0.133s       2.05e-05s     C     6500       65   GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
   0.4%    92.6%       0.133s       2.29e-05s     C     5800       58   GpuElemwise{Switch,no_inplace}
   0.4%    93.0%       0.131s       1.99e-05s     C     6600       66   GpuElemwise{Mul}[(0, 0)]
   ... (remaining 262 Ops account for   6.99%(2.23s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  29.2%    29.2%       9.321s       9.32e-02s    100   2406                     forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 12), strides=c 
    input 2: dtype=float32, shape=(15, 10, 200), strides=c 
    input 3: dtype=float32, shape=(15, 10, 100), strides=c 
    input 4: dtype=float32, shape=(15, 10, 100), strides=c 
    input 5: dtype=float32, shape=(15, 10, 100), strides=c 
    input 6: dtype=float32, shape=(15, 10, 1), strides=c 
    input 7: dtype=float32, shape=(15, 10, 200), strides=c 
    input 8: dtype=float32, shape=(15, 10, 12), strides=c 
    input 9: dtype=float32, shape=(15, 10, 200), strides=c 
    input 10: dtype=float32, shape=(15, 10, 100), strides=c 
    input 11: dtype=float32, shape=(15, 10, 100), strides=c 
    input 12: dtype=float32, shape=(15, 10, 100), strides=c 
    input 13: dtype=float32, shape=(15, 10, 200), strides=c 
    input 14: dtype=float32, shape=(16, 10, 100), strides=c 
    input 15: dtype=float32, shape=(16, 10, 200), strides=c 
    input 16: dtype=float32, shape=(16, 10, 12), strides=c 
    input 17: dtype=float32, shape=(16, 10, 100), strides=c 
    input 18: dtype=float32, shape=(16, 10, 200), strides=c 
    input 19: dtype=float32, shape=(16, 10, 12), strides=c 
    input 20: dtype=float32, shape=(2, 100, 1), strides=c 
    input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    input 23: dtype=float32, shape=(2, 100, 1), strides=c 
    input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    input 26: dtype=int64, shape=(), strides=c 
    input 27: dtype=int64, shape=(), strides=c 
    input 28: dtype=int64, shape=(), strides=c 
    input 29: dtype=int64, shape=(), strides=c 
    input 30: dtype=int64, shape=(), strides=c 
    input 31: dtype=int64, shape=(), strides=c 
    input 32: dtype=int64, shape=(), strides=c 
    input 33: dtype=int64, shape=(), strides=c 
    input 34: dtype=float32, shape=(100, 200), strides=c 
    input 35: dtype=float32, shape=(200, 200), strides=c 
    input 36: dtype=float32, shape=(100, 100), strides=c 
    input 37: dtype=float32, shape=(200, 100), strides=c 
    input 38: dtype=float32, shape=(100, 100), strides=c 
    input 39: dtype=float32, shape=(200, 200), strides=c 
    input 40: dtype=float32, shape=(200, 100), strides=c 
    input 41: dtype=float32, shape=(100, 100), strides=c 
    input 42: dtype=float32, shape=(100, 200), strides=c 
    input 43: dtype=float32, shape=(100, 100), strides=c 
    input 44: dtype=int64, shape=(2,), strides=c 
    input 45: dtype=float32, shape=(12, 10, 100), strides=c 
    input 46: dtype=int64, shape=(1,), strides=c 
    input 47: dtype=float32, shape=(12, 10), strides=c 
    input 48: dtype=float32, shape=(12, 10, 200), strides=c 
    input 49: dtype=float32, shape=(100, 1), strides=c 
    input 50: dtype=int8, shape=(10,), strides=c 
    input 51: dtype=float32, shape=(1, 100), strides=c 
    input 52: dtype=float32, shape=(100, 200), strides=c 
    input 53: dtype=float32, shape=(200, 200), strides=c 
    input 54: dtype=float32, shape=(100, 100), strides=c 
    input 55: dtype=float32, shape=(200, 100), strides=c 
    input 56: dtype=float32, shape=(100, 100), strides=c 
    input 57: dtype=float32, shape=(200, 200), strides=c 
    input 58: dtype=float32, shape=(200, 100), strides=c 
    input 59: dtype=float32, shape=(100, 100), strides=c 
    input 60: dtype=float32, shape=(100, 200), strides=c 
    input 61: dtype=float32, shape=(100, 100), strides=c 
    input 62: dtype=int64, shape=(2,), strides=c 
    input 63: dtype=float32, shape=(12, 10, 100), strides=c 
    input 64: dtype=int64, shape=(1,), strides=c 
    input 65: dtype=float32, shape=(12, 10), strides=c 
    input 66: dtype=float32, shape=(12, 10, 200), strides=c 
    input 67: dtype=float32, shape=(100, 1), strides=c 
    input 68: dtype=int8, shape=(10,), strides=c 
    input 69: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(16, 10, 100), strides=c 
    output 1: dtype=float32, shape=(16, 10, 200), strides=c 
    output 2: dtype=float32, shape=(16, 10, 12), strides=c 
    output 3: dtype=float32, shape=(16, 10, 100), strides=c 
    output 4: dtype=float32, shape=(16, 10, 200), strides=c 
    output 5: dtype=float32, shape=(16, 10, 12), strides=c 
    output 6: dtype=float32, shape=(2, 100, 1), strides=c 
    output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    output 9: dtype=float32, shape=(2, 100, 1), strides=c 
    output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    output 12: dtype=float32, shape=(15, 10, 100), strides=c 
    output 13: dtype=float32, shape=(15, 10, 200), strides=c 
    output 14: dtype=float32, shape=(15, 10, 100), strides=c 
    output 15: dtype=float32, shape=(15, 100, 10), strides=c 
    output 16: dtype=float32, shape=(15, 10, 100), strides=c 
    output 17: dtype=float32, shape=(15, 10, 200), strides=c 
    output 18: dtype=float32, shape=(15, 10, 100), strides=c 
    output 19: dtype=float32, shape=(15, 100, 10), strides=c 
  19.3%    48.5%       6.165s       6.16e-02s    100   1795                     forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=c 
    input 2: dtype=float32, shape=(1, 10, 200), strides=c 
    input 3: dtype=float32, shape=(1, 92160), strides=c 
    input 4: dtype=float32, shape=(1, 10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 10, 200), strides=c 
    input 6: dtype=float32, shape=(2, 92160), strides=c 
    input 7: dtype=int64, shape=(), strides=c 
    input 8: dtype=int64, shape=(), strides=c 
    input 9: dtype=float32, shape=(100, 44), strides=c 
    input 10: dtype=float32, shape=(200, 44), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(200, 200), strides=c 
    input 13: dtype=float32, shape=(45, 100), strides=c 
    input 14: dtype=float32, shape=(100, 200), strides=c 
    input 15: dtype=float32, shape=(100, 100), strides=c 
    input 16: dtype=float32, shape=(200, 100), strides=c 
    input 17: dtype=float32, shape=(100, 100), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(1, 44), strides=c 
    input 20: dtype=float32, shape=(1, 200), strides=c 
    input 21: dtype=float32, shape=(1, 100), strides=c 
    input 22: dtype=int64, shape=(1,), strides=c 
    input 23: dtype=float32, shape=(12, 10), strides=c 
    input 24: dtype=float32, shape=(12, 10, 200), strides=c 
    input 25: dtype=float32, shape=(100, 1), strides=c 
    input 26: dtype=int8, shape=(10,), strides=c 
    input 27: dtype=float32, shape=(12, 10, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10, 200), strides=c 
    input 29: dtype=float32, shape=(12, 10, 100), strides=c 
    output 0: dtype=float32, shape=(1, 10, 100), strides=c 
    output 1: dtype=float32, shape=(1, 10, 200), strides=c 
    output 2: dtype=float32, shape=(1, 92160), strides=c 
    output 3: dtype=float32, shape=(1, 10, 100), strides=c 
    output 4: dtype=float32, shape=(1, 10, 200), strides=c 
    output 5: dtype=float32, shape=(2, 92160), strides=c 
    output 6: dtype=int64, shape=(15, 10), strides=c 
    output 7: dtype=int64, shape=(15, 10), strides=c 
  11.5%    60.0%       3.680s       3.68e-02s    100   2157                     forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 200), strides=c 
    input 2: dtype=float32, shape=(15, 10, 100), strides=c 
    input 3: dtype=float32, shape=(15, 10, 1), strides=c 
    input 4: dtype=float32, shape=(15, 10, 200), strides=c 
    input 5: dtype=float32, shape=(15, 10, 100), strides=c 
    input 6: dtype=float32, shape=(16, 10, 100), strides=c 
    input 7: dtype=float32, shape=(16, 10, 200), strides=c 
    input 8: dtype=float32, shape=(16, 10, 12), strides=c 
    input 9: dtype=float32, shape=(16, 10, 100), strides=c 
    input 10: dtype=float32, shape=(16, 10, 200), strides=c 
    input 11: dtype=float32, shape=(16, 10, 12), strides=c 
    input 12: dtype=float32, shape=(100, 200), strides=c 
    input 13: dtype=float32, shape=(200, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(200, 100), strides=c 
    input 16: dtype=float32, shape=(100, 100), strides=c 
    input 17: dtype=float32, shape=(12, 10), strides=c 
    input 18: dtype=float32, shape=(12, 10, 100), strides=c 
    input 19: dtype=int64, shape=(1,), strides=c 
    input 20: dtype=float32, shape=(12, 10, 200), strides=c 
    input 21: dtype=int8, shape=(10,), strides=c 
    input 22: dtype=float32, shape=(100, 1), strides=c 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(200, 200), strides=c 
    input 25: dtype=float32, shape=(100, 100), strides=c 
    input 26: dtype=float32, shape=(200, 100), strides=c 
    input 27: dtype=float32, shape=(100, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10), strides=c 
    input 29: dtype=float32, shape=(12, 10, 100), strides=c 
    input 30: dtype=int64, shape=(1,), strides=c 
    input 31: dtype=float32, shape=(12, 10, 200), strides=c 
    input 32: dtype=int8, shape=(10,), strides=c 
    input 33: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(16, 10, 100), strides=c 
    output 1: dtype=float32, shape=(16, 10, 200), strides=c 
    output 2: dtype=float32, shape=(16, 10, 12), strides=c 
    output 3: dtype=float32, shape=(16, 10, 100), strides=c 
    output 4: dtype=float32, shape=(16, 10, 200), strides=c 
    output 5: dtype=float32, shape=(16, 10, 12), strides=c 
   7.2%    67.2%       2.311s       2.31e-02s    100   2602                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0,
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 100), strides=c 
    input 4: dtype=float32, shape=(12, 10, 1), strides=c 
    input 5: dtype=float32, shape=(12, 10, 200), strides=c 
    input 6: dtype=float32, shape=(12, 10, 100), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(12, 10, 1), strides=c 
    input 9: dtype=float32, shape=(13, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=int64, shape=(), strides=c 
    input 12: dtype=int64, shape=(), strides=c 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=float32, shape=(100, 200), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(200, 100), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(100, 200), strides=c 
    input 22: dtype=float32, shape=(100, 100), strides=c 
    input 23: dtype=float32, shape=(200, 100), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(12, 10, 200), strides=c 
    output 4: dtype=float32, shape=(12, 100, 10), strides=c 
    output 5: dtype=float32, shape=(12, 10, 100), strides=c 
    output 6: dtype=float32, shape=(12, 10, 200), strides=c 
    output 7: dtype=float32, shape=(12, 100, 10), strides=c 
   7.2%    74.4%       2.305s       2.30e-02s    100   2603                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 100), strides=c 
    input 4: dtype=float32, shape=(12, 10, 1), strides=c 
    input 5: dtype=float32, shape=(12, 10, 200), strides=c 
    input 6: dtype=float32, shape=(12, 10, 100), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(12, 10, 1), strides=c 
    input 9: dtype=float32, shape=(13, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=int64, shape=(), strides=c 
    input 12: dtype=int64, shape=(), strides=c 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=float32, shape=(100, 200), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(200, 100), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(100, 200), strides=c 
    input 22: dtype=float32, shape=(100, 100), strides=c 
    input 23: dtype=float32, shape=(200, 100), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(12, 10, 200), strides=c 
    output 4: dtype=float32, shape=(12, 100, 10), strides=c 
    output 5: dtype=float32, shape=(12, 10, 100), strides=c 
    output 6: dtype=float32, shape=(12, 10, 200), strides=c 
    output 7: dtype=float32, shape=(12, 100, 10), strides=c 
   5.0%    79.4%       1.599s       1.60e-02s    100   1601                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 1), strides=c 
    input 4: dtype=float32, shape=(12, 10, 200), strides=c 
    input 5: dtype=float32, shape=(12, 10, 100), strides=c 
    input 6: dtype=float32, shape=(12, 10, 1), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(13, 10, 100), strides=c 
    input 9: dtype=float32, shape=(12, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(13, 10, 100), strides=c 
   3.1%    82.6%       1.002s       1.00e-02s    100   1861                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    input 1: dtype=float32, shape=(15, 10), strides=c 
    input 2: dtype=int64, shape=(12, 10), strides=c 
    input 3: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10, 1), strides=c 
   2.7%    85.2%       0.851s       8.51e-03s    100   1611                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 1), strides=c 
    input 4: dtype=float32, shape=(12, 10, 200), strides=c 
    input 5: dtype=float32, shape=(12, 10, 100), strides=c 
    input 6: dtype=float32, shape=(12, 10, 1), strides=c 
    input 7: dtype=float32, shape=(13, 10, 100), strides=c 
    input 8: dtype=float32, shape=(13, 10, 100), strides=c 
    input 9: dtype=float32, shape=(100, 200), strides=c 
    input 10: dtype=float32, shape=(100, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
   0.0%    85.3%       0.011s       1.11e-04s    100   2572                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(12, 10, 100), strides=c 
   0.0%    85.3%       0.010s       1.05e-04s    100   2573                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(12, 10, 100), strides=c 
   0.0%    85.3%       0.008s       8.06e-05s    100   2356                     GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(15, 10), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(14, 10), strides=c 
    output 1: dtype=float32, shape=(1, 10), strides=c 
   0.0%    85.4%       0.007s       7.49e-05s    100   1739                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 100), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   0.0%    85.4%       0.007s       7.41e-05s    100   1731                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 100), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   0.0%    85.4%       0.007s       7.37e-05s    100   1682                     GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
    input 0: dtype=int8, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 100), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   0.0%    85.4%       0.007s       7.11e-05s    100   2477                     GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   0.0%    85.5%       0.007s       6.96e-05s    100   3110                     GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   0.0%    85.5%       0.007s       6.84e-05s    100   2488                     GpuCAReduce{pre=sqr,red=add}{1,1}(Assert{msg='Theano Assert failed!'}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   0.0%    85.5%       0.007s       6.74e-05s    100   3370                     GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   0.0%    85.5%       0.007s       6.71e-05s    100   3367                     GpuCAReduce{add}{1,1}(GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   0.0%    85.5%       0.007s       6.63e-05s    100   3565                     GpuCAReduce{pre=sqr,red=add}{1,1}(GpuElemwise{Switch,no_inplace}.0)
    input 0: dtype=float32, shape=(200, 200), strides=c 
    output 0: dtype=float32, shape=(), strides=c 
   ... (remaining 3554 Apply instances account for 14.46%(4.62s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 57KB (61KB)
        GPU: 4979KB (6661KB)
        CPU + GPU: 5035KB (6721KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 56KB (61KB)
        GPU: 6160KB (7107KB)
        CPU + GPU: 6216KB (7167KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 115KB
        GPU: 16958KB
        CPU + GPU: 17073KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

       1576960B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
       1132320B  [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
        399360B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12)] i i i i i i forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0, state_to_gates, W, state_to_state, W, W, GpuFromHost.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuJoin.0, All{0}.0, GpuReshape{2}.0)
        368640B  [(92160,)] v GpuSubtensor{int64}(forall_inplace,gpu,generator_generate_scan&generator_generate_scan}.5, ScalarFromTensor.0)
        368640B  [(1, 92160)] v GpuDimShuffle{x,0}(<CudaNdarrayType(float32, vector)>)
        368640B  [(1, 92160)] c GpuAllocEmpty(TensorConstant{1}, Shape_i{0}.0)
        368640B  [(1, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, GpuDimShuffle{x,0}.0, Constant{1})
        368640B  [(1, 92160)] v Rebroadcast{0}(GpuDimShuffle{x,0}.0)
        200000B  [(12, 10, 100), (13, 10, 100), (12, 10, 100), (13, 10, 100)] i i i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
        192000B  [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, Elemwise{sub,no_inplace}.0, Elemwise{switch,no_inplace}.0, Elemwise{add,no_inplace}.0)
        192000B  [(2, 12, 10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i3), Switch(LT((Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4), i3), i3, (Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2) + i2 + i4)), Switch(LT(Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5), Composite{maximum(maximum(i0, i1), i2)}(i0, i1, i2), i5)) - i3)}}.0, max_attended_length, generator_generate_batch_size, Elemwise{add,no_inplace}.0)
        160000B  [(200, 200)] v GpuDimShuffle{1,0}(W)
        160000B  [(200, 200)] i GpuElemwise{Mul}[(0, 0)](Assert{msg='Theano Assert failed!'}.0, GpuDimShuffle{x,x}.0)
        160000B  [(200, 200)] c GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}(GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}.0, GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)].0, GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)].0, GpuDimShuffle{x,x}.0)
        160000B  [(200, 200)] v GpuDimShuffle{1,0}(W)
        160000B  [(200, 200)] v Assert{msg='Theano Assert failed!'}(GpuDot22.0, Elemwise{eq,no_inplace}.0, Elemwise{Composite{EQ(i0, Switch(i1, (i2 // (-(i3 * i4 * i5))), i3))}}.0)
   ... (remaining 3554 Apply account for 51935459B/60721859B ((85.53%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 1.585470e+00s

  Total time spent in calling the VM 1.543819e+00s (97.373%)
  Total overhead (computing slices..) 4.165101e-02s (2.627%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  57.3%    57.3%       0.500s       5.21e-05s     C     9600       8   theano.sandbox.cuda.blas.GpuGemm
  39.2%    96.5%       0.343s       2.04e-05s     C    16800      14   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.5%   100.0%       0.031s       3.19e-06s     C     9600       8   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  57.3%    57.3%       0.500s       5.21e-05s     C     9600        8   GpuGemm{no_inplace}
  12.5%    69.7%       0.109s       2.27e-05s     C     4800        4   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
  10.8%    80.5%       0.094s       1.96e-05s     C     4800        4   GpuElemwise{mul,no_inplace}
  10.6%    91.1%       0.093s       1.93e-05s     C     4800        4   GpuElemwise{ScalarSigmoid}[(0, 0)]
   5.4%    96.5%       0.047s       1.96e-05s     C     2400        2   GpuElemwise{sub,no_inplace}
   1.9%    98.4%       0.017s       3.49e-06s     C     4800        4   GpuSubtensor{::, :int64:}
   1.6%   100.0%       0.014s       2.89e-06s     C     4800        4   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   7.3%     7.3%       0.063s       5.28e-05s   1200     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   7.2%    14.5%       0.063s       5.25e-05s   1200     4                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   7.2%    21.7%       0.063s       5.25e-05s   1200     2                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   7.2%    28.8%       0.063s       5.21e-05s   1200     5                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   7.1%    36.0%       0.062s       5.18e-05s   1200    22                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.1%    43.1%       0.062s       5.17e-05s   1200    23                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.1%    50.2%       0.062s       5.16e-05s   1200    25                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   7.1%    57.3%       0.062s       5.16e-05s   1200    24                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.2%    60.4%       0.028s       2.31e-05s   1200    26                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.1%    63.5%       0.027s       2.27e-05s   1200    27                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]23[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.1%    66.7%       0.027s       2.26e-05s   1200    28                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.1%    69.7%       0.027s       2.24e-05s   1200    29                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.9%    72.6%       0.025s       2.09e-05s   1200     0                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 1), strides=c 
   2.7%    75.3%       0.024s       1.97e-05s   1200    18                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    78.0%       0.024s       1.96e-05s   1200    21                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    80.7%       0.023s       1.96e-05s   1200    20                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    83.4%       0.023s       1.95e-05s   1200    19                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.7%    86.1%       0.023s       1.95e-05s   1200     6                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.7%    88.7%       0.023s       1.93e-05s   1200     8                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.6%    91.3%       0.023s       1.92e-05s   1200     7                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   ... (remaining 10 Apply instances account for 8.66%(0.08s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 27KB (51KB)
        CPU + GPU: 27KB (51KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 27KB (51KB)
        CPU + GPU: 27KB (51KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 78KB
        CPU + GPU: 78KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace01[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace23[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]23[cuda], state_to_gates_copy23[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]01[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace01[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy01[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]23[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]01[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace23[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy23[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
   ... (remaining 10 Apply account for 32080B/144080B ((22.27%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( gatedrecurrent_apply_scan&gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 8.424067e-01s

  Total time spent in calling the VM 8.255837e-01s (98.003%)
  Total overhead (computing slices..) 1.682305e-02s (1.997%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  54.5%    54.5%       0.248s       5.17e-05s     C     4800       4   theano.sandbox.cuda.blas.GpuGemm
  42.3%    96.7%       0.193s       2.01e-05s     C     9600       8   theano.sandbox.cuda.basic_ops.GpuElemwise
   3.3%   100.0%       0.015s       3.10e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  54.5%    54.5%       0.248s       5.17e-05s     C     4800        4   GpuGemm{no_inplace}
  11.7%    66.2%       0.054s       2.23e-05s     C     2400        2   GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}
  10.3%    76.5%       0.047s       1.95e-05s     C     2400        2   GpuElemwise{mul,no_inplace}
  10.2%    86.7%       0.047s       1.94e-05s     C     2400        2   GpuElemwise{sub,no_inplace}
  10.0%    96.7%       0.046s       1.91e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   1.8%    98.5%       0.008s       3.35e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   1.5%   100.0%       0.007s       2.85e-06s     C     2400        2   GpuSubtensor{::, int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  13.8%    13.8%       0.063s       5.24e-05s   1200     1                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  13.6%    27.4%       0.062s       5.17e-05s   1200     3                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
  13.6%    40.9%       0.062s       5.15e-05s   1200    13                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
  13.5%    54.5%       0.062s       5.14e-05s   1200    12                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.9%    60.4%       0.027s       2.25e-05s   1200    14                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.8%    66.2%       0.027s       2.22e-05s   1200    15                     GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(1, 1), strides=c 
    input 5: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.4%    71.6%       0.025s       2.04e-05s   1200     0                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 1), strides=c 
   5.1%    76.7%       0.023s       1.95e-05s   1200    10                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.1%    81.9%       0.023s       1.95e-05s   1200    11                     GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   5.0%    86.9%       0.023s       1.91e-05s   1200     4                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   5.0%    91.9%       0.023s       1.90e-05s   1200     5                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   4.8%    96.7%       0.022s       1.84e-05s   1200     2                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 1), strides=c 
   0.9%    97.6%       0.004s       3.39e-06s   1200     6                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   0.9%    98.5%       0.004s       3.31e-06s   1200     8                     GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   0.8%    99.3%       0.004s       2.97e-06s   1200     7                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   0.7%   100.0%       0.003s       2.74e-06s   1200     9                     GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 20KB (27KB)
        CPU + GPU: 20KB (27KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 39KB
        CPU + GPU: 39KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]1[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh(i1) * i2) + (i3 * (i4 - i2)))) + (i5 * i3))},no_inplace}(<CudaNdarrayType(float32, col)>, GpuGemm{no_inplace}.0, GpuSubtensor{::, :int64:}.0, gatedrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_initial_states_states[t-1]0[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
   ... (remaining 2 Apply account for   80B/72080B ((0.11%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( generator_generate_scan&generator_generate_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 6.135115e+00s

  Total time spent in calling the VM 5.882160e+00s (95.877%)
  Total overhead (computing slices..) 2.529552e-01s (4.123%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  26.5%    26.5%       0.788s       2.02e-05s     C    39000      26   theano.sandbox.cuda.basic_ops.GpuElemwise
  22.6%    49.0%       0.672s       4.48e-05s     C    15000      10   theano.sandbox.cuda.blas.GpuGemm
  19.1%    68.1%       0.569s       3.79e-05s     C    15000      10   theano.sandbox.cuda.blas.GpuDot22
  10.9%    79.0%       0.325s       2.16e-05s     C    15000      10   theano.sandbox.cuda.basic_ops.GpuCAReduce
   4.5%    83.6%       0.135s       4.51e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   4.0%    87.5%       0.118s       3.93e-05s     C     3000       2   theano.sandbox.rng_mrg.GPU_mrg_uniform
   3.8%    91.3%       0.113s       1.89e-05s     C     6000       4   theano.sandbox.cuda.basic_ops.HostFromGpu
   1.8%    93.1%       0.052s       2.49e-06s     C    21000      14   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.7%    94.8%       0.050s       1.67e-05s     C     3000       2   theano.tensor.basic.MaxAndArgmax
   1.1%    95.9%       0.034s       2.25e-06s     C    15000      10   theano.compile.ops.Shape_i
   1.0%    96.9%       0.029s       3.17e-06s     C     9000       6   theano.sandbox.cuda.basic_ops.GpuReshape
   0.8%    97.7%       0.023s       1.55e-05s     C     1500       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.8%    98.4%       0.023s       1.90e-06s     C    12000       8   theano.tensor.opt.MakeVector
   0.7%    99.1%       0.020s       3.31e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.4%    99.5%       0.013s       4.27e-06s     C     3000       2   theano.sandbox.multinomial.MultinomialFromUniform
   0.3%    99.8%       0.009s       2.11e-06s     C     4500       3   theano.tensor.elemwise.Elemwise
   0.2%   100.0%       0.005s       3.28e-06s     C     1500       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  22.6%    22.6%       0.672s       4.48e-05s     C     15000       10   GpuGemm{inplace}
  19.1%    41.7%       0.569s       3.79e-05s     C     15000       10   GpuDot22
   4.5%    46.2%       0.135s       4.51e-05s     C     3000        2   GpuAdvancedSubtensor1
   4.2%    50.4%       0.124s       2.07e-05s     C     6000        4   GpuElemwise{mul,no_inplace}
   4.0%    54.3%       0.118s       3.93e-05s     C     3000        2   GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}
   3.8%    58.1%       0.113s       1.89e-05s     C     6000        4   HostFromGpu
   2.6%    60.7%       0.076s       2.54e-05s     C     3000        2   GpuCAReduce{maximum}{1,0}
   2.5%    63.2%       0.076s       2.52e-05s     C     3000        2   GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}
   2.3%    65.6%       0.070s       2.33e-05s     C     3000        2   GpuCAReduce{add}{1,0,0}
   2.1%    67.7%       0.063s       2.10e-05s     C     3000        2   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   2.1%    69.8%       0.062s       2.07e-05s     C     3000        2   GpuElemwise{add,no_inplace}
   2.1%    71.8%       0.061s       2.04e-05s     C     3000        2   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   2.0%    73.9%       0.061s       2.02e-05s     C     3000        2   GpuCAReduce{maximum}{0,1}
   2.0%    75.9%       0.060s       1.99e-05s     C     3000        2   GpuCAReduce{add}{1,0}
   2.0%    77.9%       0.059s       1.96e-05s     C     3000        2   GpuElemwise{Composite{exp((i0 - i1))},no_inplace}
   2.0%    79.8%       0.059s       1.96e-05s     C     3000        2   GpuElemwise{TrueDiv}[(0, 0)]
   1.9%    81.8%       0.058s       1.93e-05s     C     3000        2   GpuCAReduce{add}{0,1}
   1.9%    83.7%       0.057s       1.91e-05s     C     3000        2   GpuElemwise{Add}[(0, 1)]
   1.9%    85.6%       0.057s       1.91e-05s     C     3000        2   GpuElemwise{Add}[(0, 0)]
   1.9%    87.5%       0.057s       1.89e-05s     C     3000        2   GpuElemwise{Composite{(i0 + log(i1))}}[(0, 0)]
   ... (remaining 21 Ops account for  12.45%(0.37s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   2.4%     2.4%       0.073s       4.84e-05s   1500    20                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.4%     4.9%       0.072s       4.82e-05s   1500    21                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.4%     7.3%       0.072s       4.81e-05s   1500    75                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%     9.7%       0.072s       4.80e-05s   1500    76                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.4%    12.1%       0.071s       4.75e-05s   1500    16                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 44), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   2.4%    14.5%       0.071s       4.74e-05s   1500    18                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 44), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 44), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   2.3%    16.8%       0.069s       4.59e-05s   1500    57                     GpuAdvancedSubtensor1(W_copy01[cuda], argmax)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(10,), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.2%    19.0%       0.067s       4.44e-05s   1500    59                     GpuAdvancedSubtensor1(W_copy01[cuda], argmax)
    input 0: dtype=float32, shape=(45, 100), strides=c 
    input 1: dtype=int64, shape=(10,), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.1%    21.2%       0.064s       4.25e-05s   1500     1                     GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   2.0%    23.2%       0.061s       4.05e-05s   1500    64                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.0%    25.3%       0.061s       4.04e-05s   1500    63                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   2.0%    27.3%       0.059s       3.96e-05s   1500    77                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.0%    29.2%       0.059s       3.95e-05s   1500    78                     GpuGemm{inplace}(GpuGemm{inplace}.0, TensorConstant{1.0}, GpuReshape{2}.0, W_copy01[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.0%    31.2%       0.059s       3.94e-05s   1500    28                     GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(92160,), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(92160,), strides=c 
    output 1: dtype=float32, shape=(10,), strides=c 
   2.0%    33.2%       0.059s       3.92e-05s   1500    26                     GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(92160,), strides=c 
    input 1: dtype=int64, shape=(1,), strides=c 
    output 0: dtype=float32, shape=(92160,), strides=c 
    output 1: dtype=float32, shape=(10,), strides=c 
   1.9%    35.1%       0.058s       3.84e-05s   1500     8                     GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.9%    37.1%       0.057s       3.83e-05s   1500     3                     GpuDot22(generator_initial_states_states[t-1]01[cuda], W_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 44), strides=c 
    output 0: dtype=float32, shape=(10, 44), strides=c 
   1.9%    39.0%       0.057s       3.82e-05s   1500     9                     GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   1.9%    40.9%       0.056s       3.72e-05s   1500    82                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.9%    42.7%       0.056s       3.72e-05s   1500    81                     GpuDot22(GpuElemwise{Composite{((tanh((i0 + i1)) * i2) + (i3 * (i4 - i2)))},no_inplace}.0, W_copy01[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   ... (remaining 95 Apply instances account for 57.27%(1.71s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 9KB (9KB)
        GPU: 837KB (923KB)
        CPU + GPU: 846KB (932KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 9KB (9KB)
        GPU: 837KB (923KB)
        CPU + GPU: 846KB (932KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 11KB
        GPU: 1080KB
        CPU + GPU: 1091KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

        368680B  [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
        368680B  [(92160,), (10,)] c c GPU_mrg_uniform{CudaNdarrayType(float32, vector),no_inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuElemwise{TrueDiv}[(0, 0)].0, cont_att_compute_weighted_averages_attended_replace01[cuda])
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda])
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace01[cuda])
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
          8000B  [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, generator_initial_states_weighted_averages[t-1]01[cuda], W_copy01[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuDot22(generator_initial_states_states[t-1]01[cuda], state_to_gates_copy01[cuda])
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](<CudaNdarrayType(float32, row)>, GpuGemm{inplace}.0)
   ... (remaining 95 Apply account for 138458B/1515818B ((9.13%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 3.657264e+00s

  Total time spent in calling the VM 3.536357e+00s (96.694%)
  Total overhead (computing slices..) 1.209071e-01s (3.306%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  34.1%    34.1%       0.573s       2.01e-05s     C    28500      19   theano.sandbox.cuda.basic_ops.GpuElemwise
  26.7%    60.8%       0.448s       3.73e-05s     C    12000       8   theano.sandbox.cuda.blas.GpuDot22
  17.2%    77.9%       0.289s       4.81e-05s     C     6000       4   theano.sandbox.cuda.blas.GpuGemm
  11.4%    89.3%       0.191s       2.13e-05s     C     9000       6   theano.sandbox.cuda.basic_ops.GpuCAReduce
   2.7%    92.1%       0.046s       1.53e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   2.5%    94.6%       0.042s       2.33e-06s     C    18000      12   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.2%    95.7%       0.020s       2.20e-06s     C     9000       6   theano.compile.ops.Shape_i
   1.1%    96.9%       0.019s       3.20e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   1.1%    98.0%       0.019s       3.11e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuReshape
   0.8%    98.8%       0.013s       2.15e-06s     C     6000       4   theano.tensor.elemwise.Elemwise
   0.7%    99.4%       0.011s       1.84e-06s     C     6000       4   theano.tensor.opt.MakeVector
   0.6%   100.0%       0.010s       3.21e-06s     C     3000       2   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  26.7%    26.7%       0.448s       3.73e-05s     C     12000        8   GpuDot22
  17.2%    43.8%       0.289s       4.81e-05s     C     6000        4   GpuGemm{inplace}
   7.3%    51.1%       0.122s       2.04e-05s     C     6000        4   GpuElemwise{mul,no_inplace}
   4.3%    55.4%       0.072s       2.42e-05s     C     3000        2   GpuCAReduce{maximum}{1,0}
   4.2%    59.7%       0.071s       2.36e-05s     C     3000        2   GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}
   3.6%    63.3%       0.061s       2.02e-05s     C     3000        2   GpuElemwise{add,no_inplace}
   3.6%    66.9%       0.060s       2.01e-05s     C     3000        2   GpuCAReduce{add}{1,0,0}
   3.6%    70.4%       0.060s       2.00e-05s     C     3000        2   GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)]
   3.5%    74.0%       0.059s       1.97e-05s     C     3000        2   GpuElemwise{Composite{(exp((i0 - i1)) * i2)}}[(0, 0)]
   3.5%    77.4%       0.059s       1.96e-05s     C     3000        2   GpuCAReduce{add}{1,0}
   3.4%    80.9%       0.058s       1.92e-05s     C     3000        2   GpuElemwise{TrueDiv}[(0, 0)]
   3.3%    84.2%       0.056s       1.86e-05s     C     3000        2   GpuElemwise{Add}[(0, 0)]
   3.3%    87.5%       0.056s       1.86e-05s     C     3000        2   GpuElemwise{Tanh}[(0, 0)]
   2.7%    90.3%       0.046s       1.53e-05s     C     3000        2   GpuFromHost
   1.8%    92.1%       0.030s       2.03e-05s     C     1500        1   GpuElemwise{sub,no_inplace}
   1.1%    93.2%       0.019s       3.11e-06s     C     6000        4   GpuReshape{2}
   0.8%    94.0%       0.014s       2.35e-06s     C     6000        4   GpuDimShuffle{x,0}
   0.7%    94.7%       0.011s       1.84e-06s     C     6000        4   MakeVector{dtype='int64'}
   0.6%    95.3%       0.011s       3.54e-06s     C     3000        2   GpuSubtensor{::, :int64:}
   0.6%    95.9%       0.010s       3.21e-06s     C     3000        2   InplaceDimShuffle{x,0}
   ... (remaining 10 Ops account for   4.11%(0.07s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   4.3%     4.3%       0.073s       4.86e-05s   1500    14                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   4.3%     8.6%       0.072s       4.82e-05s   1500    17                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   4.3%    12.9%       0.072s       4.78e-05s   1500    35                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   4.3%    17.2%       0.072s       4.78e-05s   1500    36                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.5%    20.6%       0.058s       3.87e-05s   1500    11                     GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   3.4%    24.0%       0.057s       3.82e-05s   1500     5                     GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=c 
   3.3%    27.4%       0.056s       3.74e-05s   1500    39                     GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.3%    30.7%       0.056s       3.71e-05s   1500    33                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.3%    34.0%       0.056s       3.70e-05s   1500    34                     GpuDot22(GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.3%    37.3%       0.055s       3.69e-05s   1500    40                     GpuDot22(GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}.0, W_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   3.3%    40.6%       0.055s       3.67e-05s   1500    50                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=c 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=c 
   3.3%    43.8%       0.055s       3.66e-05s   1500    49                     GpuDot22(GpuElemwise{Tanh}[(0, 0)].0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=c 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=c 
   2.2%    46.0%       0.036s       2.43e-05s   1500    53                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=float32, shape=(10,), strides=c 
   2.1%    48.2%       0.036s       2.40e-05s   1500    54                     GpuCAReduce{maximum}{1,0}(GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=float32, shape=(10,), strides=c 
   2.1%    50.3%       0.036s       2.37e-05s   1500    37                     GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 1), strides=c 
    input 6: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   2.1%    52.4%       0.035s       2.35e-05s   1500    38                     GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace0[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]0[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
    input 0: dtype=float32, shape=(10, 1), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 1), strides=c 
    input 6: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=c 
   1.9%    54.3%       0.032s       2.13e-05s   1500    71                     GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
    input 0: dtype=float32, shape=(12, 10, 1), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   1.9%    56.2%       0.032s       2.11e-05s   1500    72                     GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
    input 0: dtype=float32, shape=(12, 10, 1), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    output 0: dtype=float32, shape=(12, 10, 200), strides=c 
   1.8%    58.0%       0.031s       2.04e-05s   1500    43                     GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
    input 0: dtype=float32, shape=(12, 10, 100), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
   1.8%    59.8%       0.030s       2.03e-05s   1500     4                     GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[[ 1.]]}, <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(1, 1), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 1), strides=c 
   ... (remaining 55 Apply instances account for 40.20%(0.68s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 118KB (204KB)
        CPU + GPU: 118KB (204KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 118KB (204KB)
        CPU + GPU: 118KB (204KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 345KB
        CPU + GPU: 345KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,x}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace1[cuda])
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuDimShuffle{0,1,2}.0, GpuDimShuffle{x,0,1}.0)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(cont_att_compute_energies_preprocessed_attended_replace0[cuda])
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)
         48000B  [(120, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuReshape{2}.0)
          8000B  [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]0[cuda], state_to_gates_copy0[cuda])
          8000B  [(10, 200)] c GpuDot22(attentionrecurrent_initial_states_states[t-1]1[cuda], state_to_gates_copy1[cuda])
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace0[cuda], GpuGemm{inplace}.0)
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]0[cuda], W_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuCAReduce{add}{1,0,0}(GpuElemwise{mul,no_inplace}.0)
          8000B  [(10, 200)] i GpuElemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](distribute_apply_gate_inputs_replace1[cuda], GpuGemm{inplace}.0)
          8000B  [(10, 200)] i GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, attentionrecurrent_initial_states_weighted_averages[t-1]1[cuda], W_copy1[cuda], TensorConstant{1.0})
          4000B  [(1, 10, 100)] v GpuDimShuffle{x,0,1}(GpuDot22.0)
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * ((tanh((i1 + i2)) * i3) + (i4 * (i5 - i3)))) + (i6 * i4))},no_inplace}(<CudaNdarrayType(float32, col)>, distribute_apply_inputs_replace1[cuda], GpuGemm{inplace}.0, GpuSubtensor{::, :int64:}.0, attentionrecurrent_initial_states_states[t-1]1[cuda], CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{sub,no_inplace}.0)
   ... (remaining 55 Apply account for 62508B/710508B ((8.80%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1500 steps) 9.275899e+00s

  Total time spent in calling the VM 9.022675e+00s (97.270%)
  Total overhead (computing slices..) 2.532237e-01s (2.730%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  40.7%    40.7%       1.753s       1.98e-05s     C    88500      59   theano.sandbox.cuda.basic_ops.GpuElemwise
  20.9%    61.6%       0.901s       3.76e-05s     C    24000      16   theano.sandbox.cuda.blas.GpuDot22
  17.1%    78.7%       0.735s       4.90e-05s     C    15000      10   theano.sandbox.cuda.blas.GpuGemm
   8.7%    87.4%       0.377s       2.09e-05s     C    18000      12   theano.sandbox.cuda.basic_ops.GpuCAReduce
   2.8%    90.2%       0.122s       2.03e-05s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   2.3%    92.6%       0.099s       2.37e-06s     C    42000      28   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   2.0%    94.5%       0.085s       1.41e-05s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.3%    95.8%       0.057s       1.90e-05s     C     3000       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   1.1%    97.0%       0.049s       3.25e-06s     C    15000      10   theano.sandbox.cuda.basic_ops.GpuReshape
   0.8%    97.8%       0.036s       2.00e-06s     C    18000      12   theano.compile.ops.Shape_i
   0.7%    98.5%       0.028s       2.36e-06s     C    12000       8   theano.tensor.elemwise.Elemwise
   0.6%    99.0%       0.024s       1.99e-06s     C    12000       8   theano.tensor.opt.MakeVector
   0.5%    99.5%       0.022s       3.62e-06s     C     6000       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%   100.0%       0.020s       3.38e-06s     C     6000       4   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  20.9%    20.9%       0.901s       3.76e-05s     C     24000       16   GpuDot22
  13.2%    34.1%       0.568s       4.74e-05s     C     12000        8   GpuGemm{inplace}
   6.9%    41.1%       0.299s       1.99e-05s     C     15000       10   GpuElemwise{mul,no_inplace}
   4.1%    45.2%       0.176s       1.96e-05s     C     9000        6   GpuCAReduce{add}{1,0}
   3.9%    49.0%       0.168s       1.86e-05s     C     9000        6   GpuElemwise{Add}[(0, 0)]
   3.9%    52.9%       0.167s       5.55e-05s     C     3000        2   GpuGemm{no_inplace}
   2.8%    55.7%       0.119s       1.98e-05s     C     6000        4   GpuElemwise{Add}[(0, 1)]
   2.6%    58.3%       0.114s       1.90e-05s     C     6000        4   GpuElemwise{add,no_inplace}
   2.0%    60.3%       0.085s       1.41e-05s     C     6000        4   GpuFromHost
   1.8%    62.1%       0.078s       2.61e-05s     C     3000        2   GpuCAReduce{maximum}{1,0}
   1.6%    63.7%       0.070s       2.32e-05s     C     3000        2   GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
   1.5%    65.3%       0.067s       2.23e-05s     C     3000        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   1.5%    66.8%       0.067s       2.22e-05s     C     3000        2   GpuElemwise{Composite{((((i0 / i1) + i2) * i3) * i4)}}[(0, 0)]
   1.5%    68.3%       0.064s       2.14e-05s     C     3000        2   GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}
   1.5%    69.8%       0.063s       2.11e-05s     C     3000        2   GpuIncSubtensor{InplaceInc;::, int64::}
   1.4%    71.2%       0.062s       2.07e-05s     C     3000        2   GpuElemwise{Composite{((-(i0 * i1)) / i2)},no_inplace}
   1.4%    72.6%       0.061s       2.05e-05s     C     3000        2   GpuCAReduce{add}{0,0,1}
   1.4%    74.1%       0.061s       2.03e-05s     C     3000        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   1.4%    75.5%       0.061s       2.02e-05s     C     3000        2   GpuCAReduce{add}{1,0,0}
   1.4%    76.9%       0.060s       2.02e-05s     C     3000        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 33 Ops account for  23.13%(1.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   2.0%     2.0%       0.085s       5.69e-05s   1500   151                     GpuGemm{no_inplace}(attentionrecurrent_do_apply_states1[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.9%     3.9%       0.081s       5.41e-05s   1500   152                     GpuGemm{no_inplace}(attentionrecurrent_do_apply_states0[cuda], TensorConstant{1.0}, GpuCAReduce{add}{1,0,0}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.8%     5.6%       0.077s       5.10e-05s   1500   172                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.8%     7.4%       0.076s       5.09e-05s   1500   174                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.7%     9.1%       0.073s       4.86e-05s   1500    83                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.7%    10.8%       0.073s       4.85e-05s   1500    27                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.7%    12.5%       0.073s       4.84e-05s   1500    36                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace0[cuda], W_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.7%    14.2%       0.072s       4.80e-05s   1500    81                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, fork_gate_inputs_apply_input__replace1[cuda], W_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=c 
    input 3: dtype=float32, shape=(200, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.6%    15.8%       0.070s       4.70e-05s   1500   171                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 200), strides=(1, 200) 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.6%    17.4%       0.070s       4.69e-05s   1500   173                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, W_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 200), strides=(1, 200) 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.5%    18.9%       0.063s       4.20e-05s   1500   132                     GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(100, 120), strides=(1, 100) 
    input 1: dtype=float32, shape=(120, 1), strides=(1, 0) 
    output 0: dtype=float32, shape=(100, 1), strides=(1, 0) 
   1.5%    20.3%       0.063s       4.19e-05s   1500     2                     GpuDot22(transition_apply_states_replace1[cuda], state_to_gates_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.5%    21.8%       0.063s       4.18e-05s   1500   177                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.5%    23.3%       0.063s       4.17e-05s   1500   175                     GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, W_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 200), strides=(1, 100) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   1.4%    24.6%       0.059s       3.95e-05s   1500    79                     GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
    input 0: dtype=float32, shape=(120, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(120, 1), strides=(1, 0) 
   1.4%    26.0%       0.058s       3.90e-05s   1500   160                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.4%    27.3%       0.058s       3.90e-05s   1500   159                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.3%    28.7%       0.058s       3.87e-05s   1500     9                     GpuDot22(transform_states_apply_input__replace1[cuda], W_copy1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.3%    30.0%       0.058s       3.87e-05s   1500    22                     GpuDot22(transform_states_apply_input__replace0[cuda], W_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   1.3%    31.4%       0.057s       3.83e-05s   1500    16                     GpuDot22(transition_apply_states_replace0[cuda], state_to_gates_copy0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(100, 200), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   ... (remaining 161 Apply instances account for 68.63%(2.96s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 369KB (376KB)
        CPU + GPU: 369KB (377KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 393KB (377KB)
        CPU + GPU: 393KB (378KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 796KB
        CPU + GPU: 796KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace1[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         96000B  [(12, 10, 200)] c GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,0,1}.0, cont_att_compute_weighted_averages_attended_replace0[cuda])
         96000B  [(12, 10, 200)] c GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}(GpuDimShuffle{x,0,1}.0, GpuElemwise{TrueDiv}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuReshape{3}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
         48000B  [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0)
         48000B  [(120, 100)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
         48000B  [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
         48000B  [(12, 10, 100)] v GpuDimShuffle{0,1,2}(GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
         48000B  [(12, 10, 100)] i GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)](GpuDimShuffle{0,1,2}.0, CudaNdarrayConstant{[[[ 1.]]]}, GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace1[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
         48000B  [(12, 10, 100)] c GpuElemwise{add,no_inplace}(GpuElemwise{Composite{(i0 * (i1 - sqr(i2)))}}[(0, 0)].0, <CudaNdarrayType(float32, 3D)>)
         48000B  [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
         48000B  [(12, 10, 100)] c GpuElemwise{Composite{tanh((i0 + i1))},no_inplace}(cont_att_compute_energies_preprocessed_attended_replace0[cuda], GpuDimShuffle{x,0,1}.0)
         48000B  [(100, 120)] v GpuDimShuffle{1,0}(GpuReshape{2}.0)
         48000B  [(120, 100)] c GpuDot22(GpuReshape{2}.0, <CudaNdarrayType(float32, matrix)>)
         48000B  [(12, 10, 100)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
   ... (remaining 161 Apply account for 443232B/1595232B ((27.78%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 2.294225e+00s

  Total time spent in calling the VM 2.171460e+00s (94.649%)
  Total overhead (computing slices..) 1.227655e-01s (5.351%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  44.1%    44.1%       0.505s       1.91e-05s     C    26400      22   theano.sandbox.cuda.basic_ops.GpuElemwise
  32.8%    76.9%       0.375s       5.21e-05s     C     7200       6   theano.sandbox.cuda.blas.GpuGemm
   8.3%    85.2%       0.095s       1.97e-05s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   8.1%    93.3%       0.093s       3.88e-05s     C     2400       2   theano.sandbox.cuda.blas.GpuDot22
   3.9%    97.2%       0.044s       1.84e-05s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   1.4%    98.6%       0.016s       3.41e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.9%    99.5%       0.010s       2.16e-06s     C     4800       4   theano.compile.ops.Shape_i
   0.5%   100.0%       0.005s       2.19e-06s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  22.1%    22.1%       0.253s       5.28e-05s     C     4800        4   GpuGemm{no_inplace}
  12.2%    34.4%       0.140s       1.94e-05s     C     7200        6   GpuElemwise{mul,no_inplace}
  10.7%    45.0%       0.122s       5.09e-05s     C     2400        2   GpuGemm{inplace}
   8.1%    53.2%       0.093s       3.88e-05s     C     2400        2   GpuDot22
   4.4%    57.6%       0.051s       2.11e-05s     C     2400        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   4.3%    61.8%       0.049s       2.03e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, int64::}
   4.2%    66.1%       0.048s       2.02e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   4.1%    70.1%       0.047s       1.94e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   4.0%    74.2%       0.046s       1.92e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, :int64:}
   3.9%    78.1%       0.045s       1.88e-05s     C     2400        2   GpuElemwise{Tanh}[(0, 0)]
   3.9%    82.0%       0.045s       1.86e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
   3.9%    85.9%       0.044s       1.84e-05s     C     2400        2   GpuAlloc{memset_0=True}
   3.8%    89.7%       0.044s       1.82e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
   3.8%    93.5%       0.044s       1.82e-05s     C     2400        2   GpuElemwise{sub,no_inplace}
   3.7%    97.2%       0.042s       1.77e-05s     C     2400        2   GpuElemwise{Mul}[(0, 0)]
   0.7%    97.9%       0.008s       3.51e-06s     C     2400        2   GpuSubtensor{::, int64::}
   0.7%    98.6%       0.008s       3.31e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   0.5%    99.1%       0.006s       2.39e-06s     C     2400        2   Shape_i{1}
   0.5%    99.6%       0.005s       2.19e-06s     C     2400        2   GpuDimShuffle{1,0}
   0.4%   100.0%       0.005s       1.94e-06s     C     2400        2   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.8%     5.8%       0.066s       5.50e-05s   1200     2                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   5.5%    11.3%       0.063s       5.26e-05s   1200     7                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   5.4%    16.7%       0.062s       5.19e-05s   1200    20                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.4%    22.1%       0.062s       5.15e-05s   1200    22                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.3%    27.5%       0.061s       5.09e-05s   1200    42                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.3%    32.8%       0.061s       5.08e-05s   1200    43                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.1%    36.9%       0.047s       3.89e-05s   1200    30                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.1%    40.9%       0.046s       3.87e-05s   1200    31                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.2%    43.1%       0.025s       2.12e-05s   1200    44                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=(1, 0) 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.2%    45.3%       0.025s       2.10e-05s   1200    45                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=(1, 0) 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    47.5%       0.025s       2.05e-05s   1200    36                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.1%    49.6%       0.025s       2.05e-05s   1200    26                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    51.8%       0.024s       2.02e-05s   1200    37                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.1%    53.8%       0.024s       1.99e-05s   1200    19                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    55.9%       0.024s       1.99e-05s   1200    27                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    58.0%       0.024s       1.98e-05s   1200    11                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.1%    60.1%       0.024s       1.98e-05s   1200    18                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.0%    62.1%       0.023s       1.95e-05s   1200     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.0%    64.1%       0.023s       1.93e-05s   1200    38                     GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.0%    66.1%       0.023s       1.92e-05s   1200     9                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   ... (remaining 26 Apply instances account for 33.85%(0.39s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 55KB (78KB)
        CPU + GPU: 55KB (78KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 66KB (86KB)
        CPU + GPU: 66KB (86KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 94KB
        CPU + GPU: 94KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] c GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
          4000B  [(10, 100)] i GpuElemwise{Tanh}[(0, 0)](GpuGemm{no_inplace}.0)
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
          4000B  [(100, 10)] v GpuDimShuffle{1,0}(GpuElemwise{mul,no_inplace}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
   ... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

 Scan Op profiling ( grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan )
 ==================
  Message: None
  Time in 100 calls of the op (for a total of 1200 steps) 2.288652e+00s

  Total time spent in calling the VM 2.166391e+00s (94.658%)
  Total overhead (computing slices..) 1.222615e-01s (5.342%)

 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  44.0%    44.0%       0.503s       1.91e-05s     C    26400      22   theano.sandbox.cuda.basic_ops.GpuElemwise
  32.9%    76.9%       0.376s       5.22e-05s     C     7200       6   theano.sandbox.cuda.blas.GpuGemm
   8.3%    85.2%       0.094s       1.97e-05s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   8.2%    93.4%       0.093s       3.89e-05s     C     2400       2   theano.sandbox.cuda.blas.GpuDot22
   3.9%    97.2%       0.044s       1.84e-05s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   1.4%    98.7%       0.016s       3.43e-06s     C     4800       4   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.9%    99.6%       0.010s       2.12e-06s     C     4800       4   theano.compile.ops.Shape_i
   0.4%   100.0%       0.005s       2.11e-06s     C     2400       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  22.2%    22.2%       0.254s       5.28e-05s     C     4800        4   GpuGemm{no_inplace}
  12.2%    34.4%       0.140s       1.94e-05s     C     7200        6   GpuElemwise{mul,no_inplace}
  10.7%    45.1%       0.122s       5.09e-05s     C     2400        2   GpuGemm{inplace}
   8.2%    53.3%       0.093s       3.89e-05s     C     2400        2   GpuDot22
   4.4%    57.7%       0.051s       2.11e-05s     C     2400        2   GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)]
   4.3%    62.0%       0.049s       2.03e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, int64::}
   4.2%    66.2%       0.048s       2.01e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}
   4.0%    70.3%       0.046s       1.93e-05s     C     2400        2   GpuElemwise{ScalarSigmoid}[(0, 0)]
   4.0%    74.3%       0.046s       1.91e-05s     C     2400        2   GpuIncSubtensor{InplaceInc;::, :int64:}
   3.9%    78.2%       0.044s       1.85e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}
   3.9%    82.0%       0.044s       1.85e-05s     C     2400        2   GpuElemwise{Tanh}[(0, 0)]
   3.9%    85.9%       0.044s       1.84e-05s     C     2400        2   GpuAlloc{memset_0=True}
   3.8%    89.7%       0.044s       1.82e-05s     C     2400        2   GpuElemwise{sub,no_inplace}
   3.8%    93.5%       0.044s       1.82e-05s     C     2400        2   GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)]
   3.7%    97.2%       0.042s       1.75e-05s     C     2400        2   GpuElemwise{Mul}[(0, 0)]
   0.7%    98.0%       0.008s       3.54e-06s     C     2400        2   GpuSubtensor{::, int64::}
   0.7%    98.7%       0.008s       3.32e-06s     C     2400        2   GpuSubtensor{::, :int64:}
   0.5%    99.2%       0.006s       2.38e-06s     C     2400        2   Shape_i{1}
   0.4%    99.6%       0.005s       2.11e-06s     C     2400        2   GpuDimShuffle{1,0}
   0.4%   100.0%       0.004s       1.87e-06s     C     2400        2   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
   5.8%     5.8%       0.066s       5.52e-05s   1200     2                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   5.5%    11.3%       0.063s       5.27e-05s   1200     7                     GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 200), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=c 
    input 3: dtype=float32, shape=(100, 200), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   5.4%    16.8%       0.062s       5.19e-05s   1200    20                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.4%    22.2%       0.062s       5.16e-05s   1200    22                     GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace0[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 3: dtype=float32, shape=(100, 100), strides=c 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.4%    27.5%       0.061s       5.10e-05s   1200    43                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace0[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   5.3%    32.9%       0.061s       5.09e-05s   1200    42                     GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}.0, state_to_gates_copy.T_replace1[cuda], TensorConstant{1.0})
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(), strides=c 
    input 2: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 3: dtype=float32, shape=(200, 100), strides=(1, 200) 
    input 4: dtype=float32, shape=(), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.1%    37.0%       0.047s       3.89e-05s   1200    30                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   4.1%    41.1%       0.047s       3.88e-05s   1200    31                     GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace0[cuda])
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(100, 100), strides=(1, 100) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.2%    43.3%       0.025s       2.12e-05s   1200    44                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states1[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states1[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=(1, 0) 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.2%    45.5%       0.025s       2.11e-05s   1200    45                     GpuElemwise{Composite{((i0 * (i1 - i2)) + (i3 * i4) + i5 + i6)}}[(0, 5)](GpuElemwise{mul,no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}, GpuSubtensor{::, :int64:}.0, gatedrecurrent_apply_states0[cuda], GpuElemwise{sub,no_inplace}.0, gatedrecurrent_apply_states0[cuda], GpuGemm{inplace}.0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(1, 1), strides=c 
    input 2: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 3: dtype=float32, shape=(10, 100), strides=c 
    input 4: dtype=float32, shape=(10, 1), strides=(1, 0) 
    input 5: dtype=float32, shape=(10, 100), strides=c 
    input 6: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.2%    47.6%       0.025s       2.05e-05s   1200    36                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.1%    49.8%       0.024s       2.04e-05s   1200    26                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    51.9%       0.024s       2.00e-05s   1200    37                     GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.1%    54.0%       0.024s       1.99e-05s   1200    27                     GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuSubtensor{::, :int64:}.0, CudaNdarrayConstant{[[ 1.]]}, GpuElemwise{Tanh}[(0, 0)].0)
    input 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    input 2: dtype=float32, shape=(1, 1), strides=c 
    input 3: dtype=float32, shape=(10, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    56.1%       0.024s       1.97e-05s   1200    19                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    58.1%       0.024s       1.97e-05s   1200    18                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace1[cuda], GpuSubtensor{::, int64::}.0)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 100), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.1%    60.2%       0.023s       1.95e-05s   1200    11                     GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   2.0%    62.2%       0.023s       1.93e-05s   1200     4                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.0%    64.2%       0.023s       1.93e-05s   1200     9                     GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states0[cuda], <CudaNdarrayType(float32, col)>)
    input 0: dtype=float32, shape=(10, 100), strides=c 
    input 1: dtype=float32, shape=(10, 1), strides=c 
    output 0: dtype=float32, shape=(10, 100), strides=(100, 1) 
   2.0%    66.3%       0.023s       1.92e-05s   1200    38                     GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
    input 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(10, 100), strides=(100, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(10, 200), strides=(200, 1) 
   ... (remaining 26 Apply instances account for 33.75%(0.39s) of the runtime)

 Memory Profile
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 0KB (0KB)
        GPU: 55KB (78KB)
        CPU + GPU: 55KB (78KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 0KB (0KB)
        GPU: 66KB (86KB)
        CPU + GPU: 66KB (86KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 0KB
        GPU: 94KB
        CPU + GPU: 94KB
 ---

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace1[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace1[cuda], state_to_gates_copy1[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] c GpuElemwise{Composite{((i0 * i1) * (i2 - i1))},no_inplace}(GpuIncSubtensor{InplaceInc;::, :int64:}.0, GpuElemwise{ScalarSigmoid}[(0, 0)].0, CudaNdarrayConstant{[[ 1.]]})
          8000B  [(10, 200)] c GpuGemm{no_inplace}(gatedrecurrent_apply_gate_inputs_replace0[cuda], TensorConstant{1.0}, gatedrecurrent_apply_states_replace0[cuda], state_to_gates_copy0[cuda], TensorConstant{1.0})
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          8000B  [(10, 200)] i GpuElemwise{ScalarSigmoid}[(0, 0)](GpuGemm{no_inplace}.0)
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, :int64:}(GpuIncSubtensor{InplaceInc;::, int64::}.0, GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)].0, Constant{100})
          8000B  [(10, 200)] i GpuIncSubtensor{InplaceInc;::, int64::}(GpuAlloc{memset_0=True}.0, GpuElemwise{Mul}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states_replace0[cuda], GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(gatedrecurrent_apply_states1[cuda], <CudaNdarrayType(float32, col)>)
          4000B  [(10, 100)] i GpuElemwise{Composite{((i0 * i1) + (-(i0 * i2)))}}[(0, 1)](GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, gatedrecurrent_apply_states_replace0[cuda])
          4000B  [(10, 100)] c GpuDot22(GpuElemwise{Composite{((i0 * i1) * (i2 - sqr(i3)))},no_inplace}.0, state_to_state_copy.T_replace1[cuda])
          4000B  [(10, 100)] v GpuSubtensor{::, :int64:}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
          4000B  [(10, 100)] c GpuElemwise{mul,no_inplace}(GpuDot22.0, GpuSubtensor{::, int64::}.0)
          4000B  [(10, 100)] c GpuGemm{no_inplace}(gatedrecurrent_apply_inputs_replace1[cuda], TensorConstant{1.0}, GpuElemwise{mul,no_inplace}.0, state_to_state_copy1[cuda], TensorConstant{1.0})
          4000B  [(10, 100)] v GpuSubtensor{::, int64::}(GpuElemwise{ScalarSigmoid}[(0, 0)].0, Constant{100})
   ... (remaining 26 Apply account for 80112B/208112B ((38.49%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
 Function profiling
 ==================
  Message: Sum of all(17) printed profiles at exit excluding Scan op profile.
  Time in 6938 calls to Function.__call__: 1.028157e+02s
  Time in Function.fn.__call__: 1.024500e+02s (99.644%)
  Time in thunks: 4.343875e+01s (42.249%)
  Total compile time: 6.253434e+02s
    Number of Apply nodes: 0
    Theano Optimizer time: 2.134617e+02s
       Theano validate time: 4.772263e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.980593e+02s
       Import time 1.529284e+01s

 Time in all call to theano.grad() 2.823545e+00s
 Time since theano import 834.193s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  66.5%    66.5%      28.871s       3.42e-02s     Py     844      11   theano.scan_module.scan_op.Scan
  21.8%    88.2%       9.454s       5.87e-02s     Py     161       2   lvsr.ops.EditDistanceOp
   4.3%    92.5%       1.874s       2.16e-05s     C    86731     877   theano.sandbox.cuda.basic_ops.GpuElemwise
   1.8%    94.3%       0.779s       3.05e-05s     C    25580     252   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.9%    95.2%       0.395s       4.64e-05s     C     8505      86   theano.sandbox.cuda.blas.GpuDot22
   0.8%    96.0%       0.340s       3.54e-06s     C    96048    1098   theano.tensor.elemwise.Elemwise
   0.7%    96.7%       0.313s       1.81e-05s     C    17247     197   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.4%    97.2%       0.180s       2.53e-05s     C     7127      75   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.4%    97.6%       0.168s       2.24e-05s     Py    7505      51   theano.ifelse.IfElse
   0.3%    97.9%       0.151s       3.26e-06s     C    46180     473   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    98.2%       0.150s       2.61e-05s     C     5766      61   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.3%    98.6%       0.137s       7.11e-06s     C    19212     205   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    98.9%       0.127s       7.95e-06s     C    16013     116   theano.compile.ops.DeepCopyOp
   0.1%    99.0%       0.056s       4.38e-05s     C     1280       9   theano.sandbox.cuda.blas.GpuGemm
   0.1%    99.1%       0.056s       1.55e-05s     C     3593      31   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.1%    99.2%       0.054s       3.50e-06s     C    15373     167   theano.tensor.opt.MakeVector
   0.1%    99.4%       0.051s       4.25e-06s     C    12067     128   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.1%    99.5%       0.046s       3.25e-06s     C    14041     157   theano.compile.ops.Shape_i
   0.1%    99.5%       0.035s       7.33e-05s     C      472       6   theano.sandbox.cuda.basic_ops.GpuJoin
   0.1%    99.6%       0.033s       5.02e-05s     C      648       7   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
   ... (remaining 24 Classes account for   0.39%(0.17s) of the runtime)

 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  21.8%    21.8%       9.454s       5.87e-02s     Py     161        2   EditDistanceOp
  21.5%    43.2%       9.321s       9.32e-02s     Py     100        1   forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}
  14.2%    57.4%       6.165s       6.16e-02s     Py     100        1   forall_inplace,gpu,generator_generate_scan&generator_generate_scan}
  10.6%    68.0%       4.615s       2.31e-02s     Py     200        2   forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}
   8.5%    76.5%       3.680s       3.68e-02s     Py     100        1   forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}
   4.7%    81.2%       2.026s       3.32e-02s     Py      61        1   forall_inplace,gpu,generator_generate_scan}
   3.7%    84.9%       1.599s       1.60e-02s     Py     100        1   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   3.2%    88.0%       1.380s       8.57e-03s     Py     161        2   forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}
   0.9%    88.9%       0.395s       4.64e-05s     C     8505       86   GpuDot22
   0.7%    89.7%       0.319s       3.80e-05s     C     8400       84   GpuCAReduce{pre=sqr,red=add}{1,1}
   0.7%    90.4%       0.313s       1.81e-05s     C     17247      197   HostFromGpu
   0.5%    90.9%       0.207s       2.13e-05s     C     9700       97   GpuElemwise{add,no_inplace}
   0.4%    91.3%       0.173s       2.20e-05s     C     7861       79   GpuElemwise{sub,no_inplace}
   0.4%    91.7%       0.171s       3.57e-05s     C     4800       48   GpuCAReduce{add}{1,1}
   0.4%    92.0%       0.155s       2.38e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) / (sqrt(i2) + i3))},no_inplace}
   0.4%    92.4%       0.154s       2.49e-05s     Py    6200       39   if{gpu}
   0.3%    92.7%       0.150s       2.34e-05s     C     6400       64   GpuElemwise{Composite{((i0 * sqrt((i1 - (i2 ** i3)))) / (i1 - (i4 ** i3)))},no_inplace}
   0.3%    93.0%       0.134s       2.05e-05s     C     6500       65   GpuElemwise{Composite{((i0 * i1) + (i2 * i3))}}[(0, 3)]
   0.3%    93.3%       0.133s       2.05e-05s     C     6500       65   GpuElemwise{Composite{((i0 * sqr(i1)) + (i2 * i3))}}[(0, 3)]
   0.3%    93.6%       0.133s       2.29e-05s     C     5800       58   GpuElemwise{Switch,no_inplace}
   ... (remaining 328 Ops account for   6.36%(2.76s) of the runtime)

 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  21.5%    21.5%       9.321s       9.32e-02s    100   2406                     forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuS
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 12), strides=c 
    input 2: dtype=float32, shape=(15, 10, 200), strides=c 
    input 3: dtype=float32, shape=(15, 10, 100), strides=c 
    input 4: dtype=float32, shape=(15, 10, 100), strides=c 
    input 5: dtype=float32, shape=(15, 10, 100), strides=c 
    input 6: dtype=float32, shape=(15, 10, 1), strides=c 
    input 7: dtype=float32, shape=(15, 10, 200), strides=c 
    input 8: dtype=float32, shape=(15, 10, 12), strides=c 
    input 9: dtype=float32, shape=(15, 10, 200), strides=c 
    input 10: dtype=float32, shape=(15, 10, 100), strides=c 
    input 11: dtype=float32, shape=(15, 10, 100), strides=c 
    input 12: dtype=float32, shape=(15, 10, 100), strides=c 
    input 13: dtype=float32, shape=(15, 10, 200), strides=c 
    input 14: dtype=float32, shape=(16, 10, 100), strides=c 
    input 15: dtype=float32, shape=(16, 10, 200), strides=c 
    input 16: dtype=float32, shape=(16, 10, 12), strides=c 
    input 17: dtype=float32, shape=(16, 10, 100), strides=c 
    input 18: dtype=float32, shape=(16, 10, 200), strides=c 
    input 19: dtype=float32, shape=(16, 10, 12), strides=c 
    input 20: dtype=float32, shape=(2, 100, 1), strides=c 
    input 21: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    input 22: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    input 23: dtype=float32, shape=(2, 100, 1), strides=c 
    input 24: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    input 25: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    input 26: dtype=int64, shape=(), strides=c 
    input 27: dtype=int64, shape=(), strides=c 
    input 28: dtype=int64, shape=(), strides=c 
    input 29: dtype=int64, shape=(), strides=c 
    input 30: dtype=int64, shape=(), strides=c 
    input 31: dtype=int64, shape=(), strides=c 
    input 32: dtype=int64, shape=(), strides=c 
    input 33: dtype=int64, shape=(), strides=c 
    input 34: dtype=float32, shape=(100, 200), strides=c 
    input 35: dtype=float32, shape=(200, 200), strides=c 
    input 36: dtype=float32, shape=(100, 100), strides=c 
    input 37: dtype=float32, shape=(200, 100), strides=c 
    input 38: dtype=float32, shape=(100, 100), strides=c 
    input 39: dtype=float32, shape=(200, 200), strides=c 
    input 40: dtype=float32, shape=(200, 100), strides=c 
    input 41: dtype=float32, shape=(100, 100), strides=c 
    input 42: dtype=float32, shape=(100, 200), strides=c 
    input 43: dtype=float32, shape=(100, 100), strides=c 
    input 44: dtype=int64, shape=(2,), strides=c 
    input 45: dtype=float32, shape=(12, 10, 100), strides=c 
    input 46: dtype=int64, shape=(1,), strides=c 
    input 47: dtype=float32, shape=(12, 10), strides=c 
    input 48: dtype=float32, shape=(12, 10, 200), strides=c 
    input 49: dtype=float32, shape=(100, 1), strides=c 
    input 50: dtype=int8, shape=(10,), strides=c 
    input 51: dtype=float32, shape=(1, 100), strides=c 
    input 52: dtype=float32, shape=(100, 200), strides=c 
    input 53: dtype=float32, shape=(200, 200), strides=c 
    input 54: dtype=float32, shape=(100, 100), strides=c 
    input 55: dtype=float32, shape=(200, 100), strides=c 
    input 56: dtype=float32, shape=(100, 100), strides=c 
    input 57: dtype=float32, shape=(200, 200), strides=c 
    input 58: dtype=float32, shape=(200, 100), strides=c 
    input 59: dtype=float32, shape=(100, 100), strides=c 
    input 60: dtype=float32, shape=(100, 200), strides=c 
    input 61: dtype=float32, shape=(100, 100), strides=c 
    input 62: dtype=int64, shape=(2,), strides=c 
    input 63: dtype=float32, shape=(12, 10, 100), strides=c 
    input 64: dtype=int64, shape=(1,), strides=c 
    input 65: dtype=float32, shape=(12, 10), strides=c 
    input 66: dtype=float32, shape=(12, 10, 200), strides=c 
    input 67: dtype=float32, shape=(100, 1), strides=c 
    input 68: dtype=int8, shape=(10,), strides=c 
    input 69: dtype=float32, shape=(1, 100), strides=c 
    output 0: dtype=float32, shape=(16, 10, 100), strides=c 
    output 1: dtype=float32, shape=(16, 10, 200), strides=c 
    output 2: dtype=float32, shape=(16, 10, 12), strides=c 
    output 3: dtype=float32, shape=(16, 10, 100), strides=c 
    output 4: dtype=float32, shape=(16, 10, 200), strides=c 
    output 5: dtype=float32, shape=(16, 10, 12), strides=c 
    output 6: dtype=float32, shape=(2, 100, 1), strides=c 
    output 7: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    output 8: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    output 9: dtype=float32, shape=(2, 100, 1), strides=c 
    output 10: dtype=float32, shape=(2, 12, 10, 200), strides=c 
    output 11: dtype=float32, shape=(2, 12, 10, 100), strides=c 
    output 12: dtype=float32, shape=(15, 10, 100), strides=c 
    output 13: dtype=float32, shape=(15, 10, 200), strides=c 
    output 14: dtype=float32, shape=(15, 10, 100), strides=c 
    output 15: dtype=float32, shape=(15, 100, 10), strides=c 
    output 16: dtype=float32, shape=(15, 10, 100), strides=c 
    output 17: dtype=float32, shape=(15, 10, 200), strides=c 
    output 18: dtype=float32, shape=(15, 10, 100), strides=c 
    output 19: dtype=float32, shape=(15, 100, 10), strides=c 
  19.5%    40.9%       8.452s       1.39e-01s     61   279                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask)
    input 0: dtype=int64, shape=(15, 75), strides=c 
    input 1: dtype=float32, shape=(15, 75), strides=c 
    input 2: dtype=int64, shape=(12, 75), strides=c 
    input 3: dtype=float32, shape=(12, 75), strides=c 
    output 0: dtype=int64, shape=(15, 75, 1), strides=c 
  14.2%    55.1%       6.165s       6.16e-02s    100   1795                     forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuD
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 10, 100), strides=c 
    input 2: dtype=float32, shape=(1, 10, 200), strides=c 
    input 3: dtype=float32, shape=(1, 92160), strides=c 
    input 4: dtype=float32, shape=(1, 10, 100), strides=c 
    input 5: dtype=float32, shape=(1, 10, 200), strides=c 
    input 6: dtype=float32, shape=(2, 92160), strides=c 
    input 7: dtype=int64, shape=(), strides=c 
    input 8: dtype=int64, shape=(), strides=c 
    input 9: dtype=float32, shape=(100, 44), strides=c 
    input 10: dtype=float32, shape=(200, 44), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(200, 200), strides=c 
    input 13: dtype=float32, shape=(45, 100), strides=c 
    input 14: dtype=float32, shape=(100, 200), strides=c 
    input 15: dtype=float32, shape=(100, 100), strides=c 
    input 16: dtype=float32, shape=(200, 100), strides=c 
    input 17: dtype=float32, shape=(100, 100), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(1, 44), strides=c 
    input 20: dtype=float32, shape=(1, 200), strides=c 
    input 21: dtype=float32, shape=(1, 100), strides=c 
    input 22: dtype=int64, shape=(1,), strides=c 
    input 23: dtype=float32, shape=(12, 10), strides=c 
    input 24: dtype=float32, shape=(12, 10, 200), strides=c 
    input 25: dtype=float32, shape=(100, 1), strides=c 
    input 26: dtype=int8, shape=(10,), strides=c 
    input 27: dtype=float32, shape=(12, 10, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10, 200), strides=c 
    input 29: dtype=float32, shape=(12, 10, 100), strides=c 
    output 0: dtype=float32, shape=(1, 10, 100), strides=c 
    output 1: dtype=float32, shape=(1, 10, 200), strides=c 
    output 2: dtype=float32, shape=(1, 92160), strides=c 
    output 3: dtype=float32, shape=(1, 10, 100), strides=c 
    output 4: dtype=float32, shape=(1, 10, 200), strides=c 
    output 5: dtype=float32, shape=(2, 92160), strides=c 
    output 6: dtype=int64, shape=(15, 10), strides=c 
    output 7: dtype=int64, shape=(15, 10), strides=c 
   8.5%    63.6%       3.680s       3.68e-02s    100   2157                     forall_inplace,gpu,attentionrecurrent_do_apply_scan&attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(15, 10, 200), strides=c 
    input 2: dtype=float32, shape=(15, 10, 100), strides=c 
    input 3: dtype=float32, shape=(15, 10, 1), strides=c 
    input 4: dtype=float32, shape=(15, 10, 200), strides=c 
    input 5: dtype=float32, shape=(15, 10, 100), strides=c 
    input 6: dtype=float32, shape=(16, 10, 100), strides=c 
    input 7: dtype=float32, shape=(16, 10, 200), strides=c 
    input 8: dtype=float32, shape=(16, 10, 12), strides=c 
    input 9: dtype=float32, shape=(16, 10, 100), strides=c 
    input 10: dtype=float32, shape=(16, 10, 200), strides=c 
    input 11: dtype=float32, shape=(16, 10, 12), strides=c 
    input 12: dtype=float32, shape=(100, 200), strides=c 
    input 13: dtype=float32, shape=(200, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(200, 100), strides=c 
    input 16: dtype=float32, shape=(100, 100), strides=c 
    input 17: dtype=float32, shape=(12, 10), strides=c 
    input 18: dtype=float32, shape=(12, 10, 100), strides=c 
    input 19: dtype=int64, shape=(1,), strides=c 
    input 20: dtype=float32, shape=(12, 10, 200), strides=c 
    input 21: dtype=int8, shape=(10,), strides=c 
    input 22: dtype=float32, shape=(100, 1), strides=c 
    input 23: dtype=float32, shape=(100, 200), strides=c 
    input 24: dtype=float32, shape=(200, 200), strides=c 
    input 25: dtype=float32, shape=(100, 100), strides=c 
    input 26: dtype=float32, shape=(200, 100), strides=c 
    input 27: dtype=float32, shape=(100, 100), strides=c 
    input 28: dtype=float32, shape=(12, 10), strides=c 
    input 29: dtype=float32, shape=(12, 10, 100), strides=c 
    input 30: dtype=int64, shape=(1,), strides=c 
    input 31: dtype=float32, shape=(12, 10, 200), strides=c 
    input 32: dtype=int8, shape=(10,), strides=c 
    input 33: dtype=float32, shape=(100, 1), strides=c 
    output 0: dtype=float32, shape=(16, 10, 100), strides=c 
    output 1: dtype=float32, shape=(16, 10, 200), strides=c 
    output 2: dtype=float32, shape=(16, 10, 12), strides=c 
    output 3: dtype=float32, shape=(16, 10, 100), strides=c 
    output 4: dtype=float32, shape=(16, 10, 200), strides=c 
    output 5: dtype=float32, shape=(16, 10, 12), strides=c 
   5.3%    68.9%       2.311s       2.31e-02s    100   2602                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0,
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 100), strides=c 
    input 4: dtype=float32, shape=(12, 10, 1), strides=c 
    input 5: dtype=float32, shape=(12, 10, 200), strides=c 
    input 6: dtype=float32, shape=(12, 10, 100), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(12, 10, 1), strides=c 
    input 9: dtype=float32, shape=(13, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=int64, shape=(), strides=c 
    input 12: dtype=int64, shape=(), strides=c 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=float32, shape=(100, 200), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(200, 100), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(100, 200), strides=c 
    input 22: dtype=float32, shape=(100, 100), strides=c 
    input 23: dtype=float32, shape=(200, 100), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(12, 10, 200), strides=c 
    output 4: dtype=float32, shape=(12, 100, 10), strides=c 
    output 5: dtype=float32, shape=(12, 10, 100), strides=c 
    output 6: dtype=float32, shape=(12, 10, 200), strides=c 
    output 7: dtype=float32, shape=(12, 100, 10), strides=c 
   5.3%    74.2%       2.305s       2.30e-02s    100   2603                     forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, Shape_i{0}.0, Shape_i{0
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 100), strides=c 
    input 4: dtype=float32, shape=(12, 10, 1), strides=c 
    input 5: dtype=float32, shape=(12, 10, 200), strides=c 
    input 6: dtype=float32, shape=(12, 10, 100), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(12, 10, 1), strides=c 
    input 9: dtype=float32, shape=(13, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=int64, shape=(), strides=c 
    input 12: dtype=int64, shape=(), strides=c 
    input 13: dtype=int64, shape=(), strides=c 
    input 14: dtype=int64, shape=(), strides=c 
    input 15: dtype=int64, shape=(), strides=c 
    input 16: dtype=int64, shape=(), strides=c 
    input 17: dtype=float32, shape=(100, 200), strides=c 
    input 18: dtype=float32, shape=(100, 100), strides=c 
    input 19: dtype=float32, shape=(200, 100), strides=c 
    input 20: dtype=float32, shape=(100, 100), strides=c 
    input 21: dtype=float32, shape=(100, 200), strides=c 
    input 22: dtype=float32, shape=(100, 100), strides=c 
    input 23: dtype=float32, shape=(200, 100), strides=c 
    input 24: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(12, 10, 200), strides=c 
    output 4: dtype=float32, shape=(12, 100, 10), strides=c 
    output 5: dtype=float32, shape=(12, 10, 100), strides=c 
    output 6: dtype=float32, shape=(12, 10, 200), strides=c 
    output 7: dtype=float32, shape=(12, 100, 10), strides=c 
   4.7%    78.9%       2.026s       3.32e-02s     61   268                     forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwis
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    input 2: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    input 3: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    input 4: dtype=int64, shape=(), strides=c 
    input 5: dtype=float32, shape=(100, 44), strides=c 
    input 6: dtype=float32, shape=(200, 44), strides=c 
    input 7: dtype=float32, shape=(100, 200), strides=c 
    input 8: dtype=float32, shape=(200, 200), strides=c 
    input 9: dtype=float32, shape=(45, 100), strides=c 
    input 10: dtype=float32, shape=(100, 200), strides=c 
    input 11: dtype=float32, shape=(100, 100), strides=c 
    input 12: dtype=float32, shape=(200, 100), strides=c 
    input 13: dtype=float32, shape=(100, 100), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    input 15: dtype=float32, shape=(1, 44), strides=(0, 1) 
    input 16: dtype=float32, shape=(1, 200), strides=(0, 1) 
    input 17: dtype=float32, shape=(1, 100), strides=(0, 1) 
    input 18: dtype=int64, shape=(1,), strides=c 
    input 19: dtype=float32, shape=(12, 75), strides=(75, 1) 
    input 20: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 21: dtype=float32, shape=(100, 1), strides=(1, 0) 
    input 22: dtype=int8, shape=(75,), strides=c 
    input 23: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 0: dtype=float32, shape=(1, 75, 100), strides=(0, 100, 1) 
    output 1: dtype=float32, shape=(1, 75, 200), strides=(0, 200, 1) 
    output 2: dtype=float32, shape=(2, 92160), strides=(92160, 1) 
    output 3: dtype=int64, shape=(15, 75), strides=c 
   3.7%    82.5%       1.599s       1.60e-02s    100   1601                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncS
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 1), strides=c 
    input 4: dtype=float32, shape=(12, 10, 200), strides=c 
    input 5: dtype=float32, shape=(12, 10, 100), strides=c 
    input 6: dtype=float32, shape=(12, 10, 1), strides=c 
    input 7: dtype=float32, shape=(12, 10, 100), strides=c 
    input 8: dtype=float32, shape=(13, 10, 100), strides=c 
    input 9: dtype=float32, shape=(12, 10, 100), strides=c 
    input 10: dtype=float32, shape=(13, 10, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    input 13: dtype=float32, shape=(100, 200), strides=c 
    input 14: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
    output 2: dtype=float32, shape=(12, 10, 100), strides=c 
    output 3: dtype=float32, shape=(13, 10, 100), strides=c 
   2.3%    84.9%       1.002s       1.00e-02s    100   1861                     EditDistanceOp(generator_generate_samples, recognizer_mask_for_prediction_output_0, labels, labels_mask10)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    input 1: dtype=float32, shape=(15, 10), strides=c 
    input 2: dtype=int64, shape=(12, 10), strides=c 
    input 3: dtype=float32, shape=(12, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10, 1), strides=c 
   2.0%    86.8%       0.851s       8.51e-03s    100   1611                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 10, 200), strides=c 
    input 2: dtype=float32, shape=(12, 10, 100), strides=c 
    input 3: dtype=float32, shape=(12, 10, 1), strides=c 
    input 4: dtype=float32, shape=(12, 10, 200), strides=c 
    input 5: dtype=float32, shape=(12, 10, 100), strides=c 
    input 6: dtype=float32, shape=(12, 10, 1), strides=c 
    input 7: dtype=float32, shape=(13, 10, 100), strides=c 
    input 8: dtype=float32, shape=(13, 10, 100), strides=c 
    input 9: dtype=float32, shape=(100, 200), strides=c 
    input 10: dtype=float32, shape=(100, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(13, 10, 100), strides=c 
    output 1: dtype=float32, shape=(13, 10, 100), strides=c 
   1.2%    88.0%       0.528s       8.66e-03s     61   254                     forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 75, 200), strides=(15000, 200, 1) 
    input 2: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 3: dtype=float32, shape=(12, 75, 1), strides=(75, 1, 0) 
    input 4: dtype=float32, shape=(12, 75, 200), strides=(-15000, 200, 1) 
    input 5: dtype=float32, shape=(12, 75, 100), strides=(-7500, 100, 1) 
    input 6: dtype=float32, shape=(12, 75, 1), strides=(-75, 1, 0) 
    input 7: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 8: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    input 9: dtype=float32, shape=(100, 200), strides=c 
    input 10: dtype=float32, shape=(100, 100), strides=c 
    input 11: dtype=float32, shape=(100, 200), strides=c 
    input 12: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
    output 1: dtype=float32, shape=(12, 75, 100), strides=(7500, 100, 1) 
   0.1%    88.1%       0.043s       3.88e-03s     11   140                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Switch}[(0, 2)].0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.1%    88.2%       0.042s       3.79e-03s     11   182                     forall_inplace,gpu,gatedrecurrent_apply_scan}(Elemwise{Maximum}[(0, 0)].0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state)
    input 0: dtype=int64, shape=(), strides=c 
    input 1: dtype=float32, shape=(12, 1, 200), strides=(-200, 0, 1) 
    input 2: dtype=float32, shape=(12, 1, 100), strides=(-100, 0, 1) 
    input 3: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
    input 4: dtype=float32, shape=(100, 200), strides=c 
    input 5: dtype=float32, shape=(100, 100), strides=c 
    output 0: dtype=float32, shape=(12, 1, 100), strides=(100, 0, 1) 
   0.1%    88.3%       0.023s       3.81e-06s   6075     0                     DeepCopyOp(labels)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   0.0%    88.3%       0.016s       2.59e-06s   6075     1                     DeepCopyOp(inputs)
    input 0: dtype=int64, shape=(12,), strides=c 
    output 0: dtype=int64, shape=(12,), strides=c 
   0.0%    88.3%       0.011s       1.11e-04s    100   2572                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(12, 10, 100), strides=c 
   0.0%    88.4%       0.010s       1.05e-04s    100   2573                     GpuSplit{2}(GpuIncSubtensor{InplaceInc;::int64}.0, TensorConstant{2}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(12, 10, 200), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(12, 10, 100), strides=c 
    output 1: dtype=float32, shape=(12, 10, 100), strides=c 
   0.0%    88.4%       0.010s       9.82e-05s    100     0                     DeepCopyOp(shared_recognizer_costs_prediction)
    input 0: dtype=int64, shape=(15, 10), strides=c 
    output 0: dtype=int64, shape=(15, 10), strides=c 
   0.0%    88.4%       0.009s       4.94e-05s    176    37                     GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
    input 0: dtype=float32, shape=(12, 200), strides=(200, 1) 
    input 1: dtype=float32, shape=(200, 100), strides=(100, 1) 
    output 0: dtype=float32, shape=(12, 100), strides=(100, 1) 
   0.0%    88.4%       0.008s       8.06e-05s    100   2356                     GpuSplit{2}(GpuElemwise{mul,no_inplace}.0, TensorConstant{0}, MakeVector{dtype='int64'}.0)
    input 0: dtype=float32, shape=(15, 10), strides=c 
    input 1: dtype=int8, shape=(), strides=c 
    input 2: dtype=int64, shape=(2,), strides=c 
    output 0: dtype=float32, shape=(14, 10), strides=c 
    output 1: dtype=float32, shape=(1, 10), strides=c 
   ... (remaining 4271 Apply instances account for 11.57%(5.03s) of the runtime)

 Memory Profile (the max between all functions in that profile)
 (Sparse variables are ignored)
 (For values in brackets, it's for linker = c|py
 ---
    Max peak memory with current setting
        CPU: 57KB (61KB)
        GPU: 4979KB (6661KB)
        CPU + GPU: 5035KB (6721KB)
    Max peak memory with current setting and Theano flag optimizer_excluding=inplace
        CPU: 56KB (61KB)
        GPU: 6160KB (7107KB)
        CPU + GPU: 6216KB (7167KB)
    Max peak memory if allow_gc=False (linker don't make a difference)
        CPU: 115KB
        GPU: 16958KB
        CPU + GPU: 17073KB
 ---

    This list is based on all functions in the profile
    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

       1576960B  [(16, 10, 100), (16, 10, 200), (16, 10, 12), (16, 10, 100), (16, 10, 200), (16, 10, 12), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (2, 100, 1), (2, 12, 10, 200), (2, 12, 10, 100), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10), (15, 10, 100), (15, 10, 200), (15, 10, 100), (15, 100, 10)] i i i i i i i i i i i i c c c c c c c c forall_inplace,gpu,grad_of_attentionrecurrent_do_apply_scan&grad_of_attentionrecurrent_do_apply_scan}(Subtensor{int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuElemwise{second,no_inplace}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{:int64:}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{int64}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0, state_to_gates, W, state_to_state, W, W, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, GpuElemwise{Add}[(0, 0)].0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuDimShuffle{1,0}.0)
       1132320B  [(1, 10, 100), (1, 10, 200), (1, 92160), (1, 10, 100), (1, 10, 200), (2, 92160), (15, 10), (15, 10)] i i i i i i c c forall_inplace,gpu,generator_generate_scan&generator_generate_scan}(recognizer_generate_n_steps0011, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps0011, recognizer_generate_n_steps0011, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0, GpuJoin.0, GpuElemwise{Add}[(0, 0)].0)
        836280B  [(1, 75, 100), (1, 75, 200), (2, 92160), (15, 75)] i i i c forall_inplace,gpu,generator_generate_scan}(recognizer_generate_n_steps, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, recognizer_generate_n_steps, W, W, state_to_gates, W, W, W, state_to_state, W, W, W, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, GpuFromHost.0, GpuJoin.0, GpuReshape{2}.0, All{0}.0, GpuElemwise{Add}[(0, 0)].0)
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}[(0, 0)].0, Shape_i{0}.0)
        737280B  [(2, 92160)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i3), (maximum(i0, i1) - i3)) + i3)}}.0, Shape_i{0}.0)
        720000B  [(12, 75, 200)] v GpuDimShuffle{0,1,2}(GpuJoin.0)
        720000B  [(900, 200)] v GpuReshape{2}(GpuDimShuffle{0,1,2}.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int64}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{-1})
        720000B  [(900, 200)] c GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
        720000B  [(12, 75, 200)] v GpuSubtensor{int64:int64:int8}(GpuElemwise{Add}[(0, 0)].0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] c GpuJoin(TensorConstant{2}, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int64}.0)
        720000B  [(12, 75, 100), (12, 75, 100)] i i forall_inplace,gpu,gatedrecurrent_apply_scan&gatedrecurrent_apply_scan}(Shape_i{0}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, state_to_gates, state_to_state, state_to_gates, state_to_state)
        720000B  [(12, 75, 200)] v GpuReshape{3}(GpuDot22.0, MakeVector{dtype='int64'}.0)
        720000B  [(12, 75, 200)] i GpuElemwise{Add}[(0, 0)](GpuReshape{3}.0, GpuDimShuffle{x,x,0}.0)
        488000B  [(13, 10, 100), (13, 10, 100), (12, 10, 100), (12, 10, 200), (12, 100, 10), (12, 10, 100), (12, 10, 200), (12, 100, 10)] i i c c c c c c forall_inplace,gpu,grad_of_gatedrecurrent_apply_scan&grad_of_gatedrecurrent_apply_scan}(max_attended_mask_length000000111111, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{::int64}.0, GpuSubtensor{::int64}.0, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, max_attended_mask_length000000111111, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, state_to_gates, state_to_state, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
   ... (remaining 4271 Apply account for 67003141B/82625821B ((81.09%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

 Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.