The purpose of this notebook is to demonstrate the ChainerX linear albebra routines in the example of Gaussian Process Regression.
import chainerx
import chainerx as np
import matplotlib.pyplot as plt
%matplotlib inline
chainerx.set_default_device('cuda:0')
Unknown functions can be represented using Gaussian Processes (GPs). GPs can be used for regression and classification tasks. Here, we focus on a simple GP regression case. There is a whole book for details Gaussian Processes for Machine Learning.
We are interested in learning a function f(\mathbf{x}) from data \{ (x_i, y_i)\; |\; i=1,\ldots,n\}, where y_i\in\mathbf{R} is a noisy observation of f(x_i).
A Gaussian process is a random process where any point \mathbf{x} \in \mathbb{R}^d is assigned a random variable f(\mathbf{x}) and where the joint distribution of a finite number of these variables p(f(\mathbf{x}_1),...,f(\mathbf{x}_N)) is itself Gaussian:
p(\mathbf{f} | \mathbf{X}) = \mathcal{N}(\mathbf{f} | \boldsymbol\mu, \mathbf{K})\label{eq1}
Here, \mathbf{f} = (f(x_1),...,f(x_N)), \boldsymbol\mu = (m(x_1),...,m(x_N)) and K_{ij} = \kappa(x_i,x_j). m is the mean function and usually considered to be zero function. \kappa is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (i.e. smoothness) is defined by \mathbf{K}. If points x_i and x_j are considered to be similar by the kernel the function values at these points, f(x_i) and f(x_j), can be expected to be similar too. The kernel function \kappa has free hyper-parameters \theta.
A GP prior p(\mathbf{f} | \mathbf{X}) can be converted into a GP posterior p(\mathbf{f} | \mathbf{X},\mathbf{y}) after having observed some data \mathbf{y}. The posterior can then be used to make predictions \mathbf{f}_* given new input \mathbf{X}_*:
p(\mathbf{f}_* | \mathbf{X}_*,\mathbf{X},\mathbf{y}) = \int{p(\mathbf{f}_* | \mathbf{X}_*,\mathbf{f})p(\mathbf{f} | \mathbf{X},\mathbf{y})}\ d\mathbf{f} = \mathcal{N}(\mathbf{f}_* | \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)
So the posterior predictive distribution is also a Gaussian with mean \boldsymbol{\mu}_* and \boldsymbol{\Sigma}_*. By definition of the GP, the joint distribution of observed data \mathbf{y} and predictions \mathbf{f}_* is
\begin{pmatrix}\mathbf{y} \\ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N} \left(\boldsymbol{0}, \begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \\ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix} \right)
With N training data and N_* new input data, \mathbf{K}_y = \kappa(\mathbf{X},\mathbf{X}) + \sigma_y^2\mathbf{I} = \mathbf{K} + \sigma_y^2\mathbf{I} is N \times N, \mathbf{K}_* = \kappa(\mathbf{X},\mathbf{X}_*) is N \times N_* and \mathbf{K}_{**} = \kappa(\mathbf{X}_*,\mathbf{X}_*) is N_* \times N_*. \sigma_y^2 is the noise term in the diagonal of \mathbf{K_y}. It is the noise of target data \mathbf{y}, so it is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to \boldsymbol{0}. The sufficient statistics of the posterior predictive distribution, \boldsymbol{\mu}_* and \boldsymbol{\Sigma}_*, can be computed with
\boldsymbol{\mu_*} = \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{y}
\boldsymbol{\Sigma_*} = \mathbf{K}_{**} - \mathbf{K}_*^T \mathbf{K}_y^{-1} \mathbf{K}_*
Given this model, there are two basic problems to be solved. First, we need to learn the hyper-parameters \theta and \sigma_y^2. Second, we need to predict mean and variance of f(\mathbf{x}_{*}) at test input points \mathbf{X}_{*}.
The squared exponential covariance function is given by
k(\mathbf{x}_i, \mathbf{x}_j) = \sigma_f^2 \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2_ 2}{2\ell^2}\right)
where \sigma_f > 0 and \ell > 0 are hyperparameters of the kernel. There are many other kernels that can be used for Gaussian processes. This specific covariance function is perhaps the most common covariance function used in statistics and machine learning. It is also known as the radial basis function kernel, the gaussian kernel, or the exponeniated quadratic kernel.
def gaussian_rbf(x1, x2, l=1, sigma_f=1):
"""Returns the NxM kernel matrix between the two sets of input X1 and X2.
Args:
X1: NxD matrix
X2: MxD matrix
alpha: scalar parameter
scale: scalar parameter
Returns
NxM kernel matrix
"""
# distance between each rows
dist_matrix = np.sum(np.square(x1), axis=1).reshape(-1, 1) + np.sum(np.square(x2), axis=1) - 2 * np.dot(x1, x2.T)
return sigma_f**2 * np.exp(-1 / (2 * l**2) * dist_matrix)
kernel = gaussian_rbf
Let’s visualize the entries of the kernel matrix.
# create an Nx1 vector of equidistant points in [-3, 9]
N = 50
X = np.linspace(-3, 9, N).reshape(-1, 1)
cov = kernel(X, X)
plt.pcolor(chainerx.to_numpy(cov))
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f04781b4470>
Let’s generate samples from a Gaussian process prior. Specifically, we will consider a zero-mean Gaussian process prior for functions of the form f: \mathbb{R} \rightarrow \mathbb{R} using the squared exponential kernel. That is,
f(x) \sim \mathcal{GP}\left(0 \, , \, k\left(x, x'\right)\right).
Let f_n = f(x_n) \in \mathbb{R} be the value of the function f evaluated at a point x_n \in \mathbb{R}. Furthermore, let \mathbf{f} \in \mathbb{R}^{N} be the vector of function values for each of the points of \mathbf{X} .
The Gaussian process prior for \mathbf{f} becomes
\mathbf{f} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{K}\right),
where \mathbf{K} is the kernel matrix.
To draw random functions from the GP we draw random samples from the corresponding multivariate normal. The common way of drawing such samples is using Cholesky decomposition, the algorithm is described in wikipedia.
def generate_samples(mean, K, S):
"""Returns M samples from a Gaussian process with kernel matrix K.
Args:
K: NxN kernel matrix
S: number of samples
Returns:
SxS matrix of samples
"""
z = np.random.normal(size = (mean.shape[0], S))
L = np.linalg.cholesky(K + 1e-6*np.eye(mean.shape[0])) # added `1e-6*eye(N)` to covariance matrix to fix non positive definitness
w = mean + L.dot(z)
return w
Next, let’s define a function for plotting a Gaussian process with specified mean and 95.45% confidence interval (2\sigma).
def plot_gp(Xp, mu, Sigma, title="", num_samples=0):
mean, std = mu.ravel(), np.sqrt(np.diag(Sigma))
# plot distribution
plt.plot(Xp, mean, color='k', label='Mean')
plt.plot(Xp, mean + 2*std, color='k', linestyle='--')
plt.plot(Xp, mean - 2*std, color='k', linestyle='--')
# plt.fill_between does not work well with chainerx arrays so we are converting the input to numpy arrays
plt.fill_between(*map(chainerx.to_numpy, (Xp.ravel(), mean - 2*std, mean + 2*std)), alpha=0.1)
# generate samples
if num_samples > 0:
fs = generate_samples(mu, Sigma, num_samples)
plt.plot(Xp, fs, color='b', alpha=.25)
plt.title(title)
plt.legend()
# Finite number of points
X = np.arange(-1, 7, 0.025).reshape(-1, 1).astype('float64')
# Mean and covariance of the prior
mu = np.zeros(X.shape, dtype='float64')
cov = kernel(X, X)
# Plot GP mean, confidence interval and samples
plot_gp(X, mu, cov, num_samples = 5)
Let’s implement computation of \boldsymbol{\mu_*} and \boldsymbol{\Sigma_*} using previously defined formulas.
def posterior_predictive(X, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
'''Computes the suffifient statistics of the GP posterior predictive distribution
from m training data X_train and Y_train and n new inputs X.
Args:
X: New input locations (N x D).
X_train: Training locations (M x D).
Y_train: Training targets (M x 1).
l: Kernel length parameter.
sigma_f: Kernel vertical variation parameter.
sigma_y: Noise parameter.
Returns:
Posterior mean vector (N x D) and covariance matrix (N x N).
'''
K = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 * np.eye(len(X_train))
K_s = kernel(X_train, X, l, sigma_f)
K_ss = kernel(X, X, l, sigma_f)
z = np.linalg.solve(K, K_s).T
mu_s = z.dot(Y_train)
cov_s = K_ss - z.dot(K_s)
return mu_s, cov_s
and apply them to noise-free training data X_train
and Y_train
.
The following example draws five samples from the posterior predictive
and plots them along with the mean, confidence interval and training
data. In a noise-free model, variance at the training points is zero and
all random functions drawn from the posterior go through the trainig
points.
# Noise free training data
X_train = np.linspace(0,6,5).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3)
# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train)
plot_gp(X, mu_s, cov_s, num_samples = 5)
plt.plot(X_train, Y_train, 'ro', label='Training data', markersize=5)
plt.legend()
<matplotlib.legend.Legend at 0x7f0478005518>
Now let’s add some noise to the model, then training points are only approximated and the variance at the training points is non-zero.
# Noisy training data
noise = 0.3
X_train = np.linspace(0,6,100).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3) + noise * np.random.normal(size=X_train.shape)
# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, sigma_y=noise)
plot_gp(X, mu_s, cov_s, num_samples = 10)
plt.plot(X_train, Y_train, 'rx')
[<matplotlib.lines.Line2D at 0x7f047801ba58>]
The following example shows the effect of kernel parameters l and \sigma_f as well as the noise parameter \sigma_y. Higher l values lead to smoother functions and therefore to coarser approximations of the training data. Lower l values make functions more wiggly with wide confidence intervals between training data points. \sigma_f controls the vertical variation of functions drawn from the GP. This can be seen by the wide confidence intervals outside the training data region in the right figure of the second row. \sigma_y represents the amount of noise in the training data. Higher \sigma_y values make more coarse approximations which avoids overfitting to noisy data.
params = [
(0.3, 1.0, 0.2),
(3.0, 1.0, 0.2),
(1.0, 0.3, 0.2),
(1.0, 3.0, 0.2),
(1.0, 1.0, 0.05),
(1.0, 1.0, 1.5),
]
plt.figure(figsize=(12, 5))
for i, (l, sigma_f, sigma_y) in enumerate(params):
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l,
sigma_f=sigma_f,
sigma_y=sigma_y)
plt.subplot(3, 2, i + 1)
plt.subplots_adjust(top=2)
plt.title(f'l = {l}, sigma_f = {sigma_f}, sigma_y = {sigma_y}')
plot_gp(X, mu_s, cov_s)
plt.plot(X_train, Y_train, 'rx')
plt.legend()
We can define the learning criteria and estimate optimal values for these parameters. Optimal values can be estimated by maximizing the marginal log-likelihood which is given by
\log p(\mathbf{y} | \mathbf{X}) = \log \mathcal{N}(\mathbf{y} | \boldsymbol{0},\mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} -\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} -\frac{N}{2} \log(2\pi)
Instead of the maximization problem we turn it into the minimization problem. We will minimize the negative marginal log-likelihood w.r.t. parameters l and \sigma_f, \sigma_y is set to the known noise level of the data. If the noise level is unknown, \sigma_y can be estimated as well along with the other parameters. The minimization problem is solved using gradient-based algorithm.
We avoid computing the determinant of
\begin{vmatrix}\mathbf{K}_y\end{vmatrix} by using the fact that
the sum of the logarithms of the diagonal entries of the Cholesky
decomposition of a matrix S
equals half of the logarithm of
determinant of matrix S
. i.e. 0.5 * log(det(S))
and
sum(log(diag(chol(S))))
are equal up to numerical precision.
from scipy.optimize import minimize
def nll_fn(X_train, Y_train, noise):
'''Returns a function that computes the negative marginal log-
likelihood for training data X_train and Y_train and given
noise level.
Args:
X_train: training locations (M x D).
Y_train: training targets (M x 1).
noise: known noise level of Y_train.
Returns:
Minimization objective.
'''
def step(theta):
K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
noise**2 * np.eye(len(X_train))
return np.sum(np.log(np.diag(np.linalg.cholesky(K)))) + \
0.5 * Y_train.T.dot(np.linalg.inv(K)).dot(Y_train) + \
0.5 * len(X_train) * np.log(np.array(2*np.pi, dtype='float64'))
return step
def objective_and_grad(params):
"""Returns both value and gradient. Suitable for use
in scipy.optimize"""
theta = np.random.normal(size=params.size).require_grad()
fun = nll_fn(X_train, Y_train, noise)
out = fun(theta)
out.backward()
return chainerx.to_numpy(out), chainerx.to_numpy(theta.grad)
init_params = 0.1*np.random.normal(size=2)
# Minimize the negative log-likelihood w.r.t. parameters l and sigma_f.
res = minimize(objective_and_grad, init_params, method='L-BFGS-B', jac=True, options={'gtol': 1e-9, 'ftol': 0})
# Store the optimization results in global variables so that we can
# compare it later with the results from other implementations.
l_opt, sigma_f_opt = res.x
# Compute the prosterior predictive statistics with optimized kernel parameters and plot the results
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l_opt, sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(X, mu_s, cov_s)
plt.plot(X_train, Y_train, 'rx')
[<matplotlib.lines.Line2D at 0x7f04716afbe0>]
res
fun: array([[66.3643472]]) hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 76.02868139, 160.05121638]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 8 nit: 3 status: 0 success: True x: array([0.40666028, 0.90399203])
plt.figure(figsize=(12, 8))
plt.plot(X_train, Y_train, 'rx')
plt.plot(X, np.sin(0.07*X**3), label='true')
plot_gp(X, mu_s, cov_s)
plt.legend()
<matplotlib.legend.Legend at 0x7f0471976400>