I have been selected for Google Summer of Code(GSoC) 2018! :D
Google Summer of Code is a global program focused on introducing students to open source software development. Students work on a 3 month programming project with an open source organization during their break from university.
I'll be working on PyMC3 which has registered under the NumFOCUS umbrella organization. My project is to explore Alternative Computational Backends for PyMC. I will be mentored by Colin Carroll and Chris Fonnesbeck.
I am a 2nd year undergrad at BITS Pilani, India, majoring in Computer Science and Mathematics. I am very passionate about machine learning. I came to know about probabilistic machine learning of late and I found it fascinating. This project provides me a unique opportunity to learn and simultaneously apply and that is the why I wanted to take this up.
My mathematics curriculum has mostly been pure/abstract. So I do not come from a statistical background. I have done a introductory course in Probability and Statistics in my freshman year. I intent bridge this gap through the course of this project.
PyMC3 is a probabilistic programming Python library based on Theano, and uses it for creating and computing the graph that comprises the probabilistic model. Given the discontinuation of support for Theano, we are exploring using alternative libraries like tensorflow.probability for PyMC4, the successor. We aim to port or re-implement some of the distributions currently present in PyMC3 using the selected framework while keeping the API, output and performance consistent.
The first month will be spent mostly on re-building model class* keeping tensorflow distributions in mind. I will start working with basic distribution like Normal Distribution.
*All models in PyMC3 are defined using such a class.
This blog is one of GSoC's requirements. But I couldn't find any specific information on what exactly it should contain. So I am taking the liberty of making this blog serve multiple purposes -
- As a journal to record the progress of the project, the resources used, the mentoring which helped me accomplish weekly tasks, the successes and the failures.
- As a technical blog to introduce new comers like me to the area of Prababilistic Machine Learning. This will contain intrductory coding projects as well as explainations of some key concepts. Colin has offered to proof-read these posts for technical accuracy.
- As a study guide to newcomers in this field. This will comprise of resources, first of which has been included in today's post.
The PyMC core developers have chosen TensorFlow to be the prime candidate for the future development of PyMC. TF provides ready distributions for the PyMC to work on. So we need to find a way to keep the existing simple and user-friendly API of PyMC3 as much as possible while switching completely to tensorflow and making use of its already-implemented distributions.
- Will TensorFlow's session based API behave well with context manager based API of PyMC3?
- Implementation of log probability with TF models.
- Computing the gradient of this log probability.
- Trade-off between keeping the existing API and making use of TensorFlow's full capabilities.
- Will the existing samplers behave well when used on TensorFlow models?
These are some of the questions this project aims to answer.
Note: Since my knowledge of Bayesian Statistics is limited my mentor, Colin Carroll has been kind enough to provide learning resources on a weekly basis. I feel it is my duty to share this valuable information. I will also be including any additional resources which I might find useful as supplements.
- 11.1 and 11.2 from "Pattern Recognition and Machine Learning"
- Part 1 of Ian Murray's lectures on MCMC
- Betancourt's video "Efficient Bayesian Inference with Hamiltonian Monte Carlo" (Watch after learning Metropolis Hastings Algorithm)