Google Summer of Code 2017 - Final Report

Author: João Victor Tozatti Risso

Date of Submission: 2017-08-28

Organization: Python Software Foundation

Suborganization: Theano

Project Title: Extend usage of optimized GPU libraries in Theano

Initial Proposal

In the initial proposal [3], the goal was to implement wrappers for functions that can be executed on the GPU, as to accelerate computations of models in Theano. More specifically, the goal was to implement the following functionalities:

Wrapper for the warp-ctc library into Theano, in order to provide fast CTC computations both in the CPU and GPU. There were two existing wrappers in GitHub, however they were neither complete, or compatible with Theano's gpuarray library.
Wrapper for a Symmetric Eigenvalue Solver, using the cuSolver library, in order to obtain the eigenvalues on the GPU.
Wrapper for a QR factorization function, also using the cuSolver library, which would enable fast eingenvalue computations and factorizations.
Implement Spatial Transformer Network Ops from cuDNN, which allows neural networks to handle distorted inputs, and learn how the transformation parameters to better extract features from images.

Changes to the Proposal

However, the Theano developers have been working on the implementation of wrappers for functions of the MAGMA library, which implements linear algebra functions in a very efficient manner, with support for multi-cores and GPUs. In the issue that lists which operations have been implemented, there are both the items 2 and 3, that is, the QR factorization and eingenvalue solver for symmetric (hermitian) matrices.

In order to avoid doing rework, it was discussed and decided with mentors that implementing the spatial transformer would be the replacement for items 2 and 3 of the proposal, since it would also be interesting to have a CPU implementation of that functionality.

Hence, the project was divided into three parts: implementation of the CTC wrapper, spatial transformer using cuDNN, and the CPU spatial transformer.

Contributions

In this section, I will describe the contributions made to Theano during the course of the project.

Connectionist Temporal Classification Loss

The first part of the project consisted in implementing a wrapper for Theano that makes use of warp-ctc [1], a fast implementation of the CTC loss function by Baidu Research. Their implementation works both on multi-core processors (by using OpenMP threads) and also on GPUs, using CUDA kernels to compute the CTC function. A more detailed explanation of how warp-ctc works, is provided in the paper that accompanied the release [2].

Outputs of a CTC network are given by a softmax layer, whose results are interpreted as a probability distribution over all possible label sequences, conditioned by a given input sequence. Given that distribution, an objective function was derived to maximize the probabilities of correct labellings. Since the objective function is differentiable, the network can be trained with backpropagation through time.

Implementations of the CTC functionality for CPU and GPU, can be found in Theano's theano.tensor.nnet.ctc and theano.gpuarray.ctc modules, respectively. Furthermore, optimizations were implemented to allow the user to call a single CPU function, and may have it 'lifted' for execution on the GPU, depending on his configurations. Finally, wrappers for CTC gradients using were also implemented for both CPU and GPU.

Below there is a brief description of each Op, and links to where they are located in Theano's codebase:

theano.tensor.nnet.ctc.ctc: function that setups a CTC Op, i.e. it setups the node on the graph that will compute the CTC loss function.
theano.tensor.nnet.ctc.ConnectionistTemporalClassification: COp class that implements the computation of the CTC loss function.
theano.gpuarray.gpu_ctc: function that setups a GPU CTC Op, i.e. it setups the node on the graph that will compute the CTC loss function on the GPU.
theano.gpuarray.GpuConnectionistTemporalClassification: COp class that implements the computation of the CTC loss function on the GPU.

In the COp classes, one can find the paths to the C wrappers, which make the interface between Theano and the warp-ctc library.

Spatial Transformer

In the second and third parts of the project, I have worked on implementing a spatial transformer, at first only on the GPU, and then on the CPU. A Spatial Transformer is a component of a neural network that can provide spatial manipulation of data within the network. Spatial manipulation can improve models by introducing invariance to affine transformations, such as translation, scaling and rotation. This kind of invariance improves classification performance, since the networks become able to recognize samples that have distortions or are rotated, for example.

There are three main components in a Spatial Transformer, as shown in the image above (provided in the paper by Jaderberg et. al):

Localisation network: neural network that receives the input feature map U, where U is a space spanned by the width, height and channels. It outputs the parameters of the transformation to be applied to the feature map. In 2D, the parameters take the form of a 2x3 matrix (i.e. an affine transformation matrix). It can take the form of any neural network, but it should include a final regression layer to produce the transformation parameters.
Grid generator: normalized grid of coordinates over the input feature map. It maps the original coordinate system of the input to an interval in [-1, 1], and applies the transformation the normalized space.
Sampler: the sampler takes a set of sampling points from the grid generator, along with the input feature map U and produces the sampled output feature map V.

Spatial Transformer using cuDNN

cuDNN provides spatial transformer functions since version 6, and those functions were utilized to implement the second part of the project. There are two types of functions: forward and backward. Forward functions implement the operations of the sampling grid, and the sampler. Backward functions are used to compute gradients of the outputs of each forward function, i.e. there is no function to compute gradients of the inputs and another to compute the gradients of the affine transformation, such that they backpropagated in the neural network, in order for it to learn.

Spatial transformer functions from cuDNN were implemented in Theano's gpuarray.dnn module, as Theano Ops. In order to wrap the required functions, I have implemented wrappers in C, which interface with Theano PyGpuArrayObject's and the cuDNN functions.

Below there is a brief description of each Op, and links to where they are located in Theano's codebase:

theano.gpuarray.dnn.dnn_spatialtf: function that setups a complete spatial transformer, i.e. it setups the sampling grid and the sampler, and returns the latter to the user.
theano.gpuarray.dnn.GpuDnnTransformerGrid: COp class that implements the sampling grid, using cuDNN's forward grid generator function.
theano.gpuarray.dnn.GpuDnnTransformerSampler: COp class that implements the sampler, which is currently limited by cuDNN to bilinear interpolation. This class interfaces with cuDNN's forward sampler function.
theano.gpuarray.dnn.GpuDnnTransformerGradI: COp class that implements the gradients of the inputs, which interfaces with cuDNN's backward sampling grid function.
theano.gpuarray.dnn.GpuDnnTransformerGradT: COp class that implements the gradients of the affine transformation, which interfaces with cuDNN's backward sampler function.

In the COp classes, one can find the paths to the C wrappers, which make the interface between Theano and cuDNN.

Spatial Transformer on the CPU

Based on a implementation in Lasagne, a spatial transformer was implemented on the CPU as well. The work on this third of the project consisted in adapting the implementation from Lasagne, which uses Theano symbolic variables to perform the computations, into Theano Ops.

However, Lasagne does not provide the implementations for the gradients, and neither does cuDNN. So those have to be implemented based on the equations of the paper by Jaderberg et al. [3]. Furthemore, it is necessary to provide a concrete implementation (e.g. using NumPy) of each of those Ops, in order to enable users to debug code that uses the functionalities provided by the spatial transformer.

Most of the implementation is completed, including the gradients of inputs, with only the gradients of the affine transformation currently not passing the gradient tests. Fixing the computations of the affine transformation is the last step required to finish the implementation of the spatial transformer on the CPU.

Pull Requests

A vast majority of the discussions with the mentors during the course of the project, were carried out in the public mailing list of Theano developers, and on GitHub Pull Requests. Furthermore, all implementations passed through a process of unit testing, and peer review by the mentors.

First and second parts of the project were successfully completed, and are already merged into Theano (see the Pull Requests below). However, the third part is not yet completed, for reasons that have already been explained.

Links to the original Pull Requests are provided below:

Connectionist Temporal Classification Loss with warp-ctc: Pull Request
Spatial Transformer using cuDNN: Pull Request
Spatial Transformer on the CPU (WIP): Pull Request

You can also see my commits in Theano, here for the CTC and Spatial Transformer with cuDNN, and here for the Spatial Transformer on the CPU.

Conclusion

In this project, I have implemented wrappers for GPU functions in Theano, in order to accelerate the computation of deep learning models. Two of the three parts of the project have been merged into Theano, with the third only requiring fixing the computation of gradients.

During this summer, I have learned a lot about the inner workings of Theano. I have also improved considerably my knowledge of Python, as I come from a strong C/C++ background.

What's Next?

I'll start getting deeper into machine learning, and Theano will be a great tool for the job. With some knowledge of the internals, I can implement my own models, as well as suggest and add new functionalities.

Acknowledgements

I would like to thank Steven Bocco, my mentor, for guiding me in the execution of the project, providing feedback and reviewing the code. I would also like to thank Frédéric Bastien and Arnaud Bergeron, for helping with organizational aspects, and code reviewing.

Finally, I would like to thank the Python Software Foundation, the staff of GSoC, and Google, for this incredible experience that is GSoC.

References

[1] "Accelerating Machine Learning with Open Source Warp-CTC". 2016. Accessed on: 2017-08-28.

[2] Amodei et. al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. 2015. Accessed on: 2017-08-28

[3] Jaderberg et. al. Spatial Transformer Networks. 2015. Accessed on: 2017-08-28

[4] Risso, J. V. T.. Extend usage of optimized GPU libraries in Theano. 2017. Accessed on: 2017-08-28

joao-timescale/final_report.md