### Summary

We are going to parallelize the training and prediction phases of Gaussian Process Regression on a GPU.

### Background

Gaussian Process (GP) regression models have become very popular in recent years in the machine learning community. This can be attributed to the flexibility offered by these models to allow users to choose the appropriate covariance function to control the degree of smoothness. GP regression is basically a Bayesian nonparametric model (where an underlying finite-dimensional random variable is replaced by a stochastic process). Formally, a GP is a stochastic process where any finite number of random variables have a joint Gaussian distribution [RW06].

A GP is completely specified by a mean function ($m(\vec{x})$) and a covariance function ($\kappa (\vec{x}_1, \vec{x}_2)$), which for a real process $f(\vec{x})$ can be defined as:

$$ m(\vec{x}) = \mathbf{E}[f(\vec{x})] \ \kappa (\vec{x}_1, \vec{x}_2) = \mathbf{E}[(f(\vec{x}_1) - m(\vec{x}_1))( f(\vec{x}_2) - m(\vec{x}_2) )] $$

Generally, the mean function is taken as zero. In that case, the covariance function ($\kappa (\vec{x}_1, \vec{x}_2)$) basically turns into the covariance between $f(\vec{x}_1)$ and $f(\vec{x}_2)$. The squared exponential (SE) kernel is a common choice of covariance function (amongst a wide range of available choices for the kernel), and is defined as follows: $$ \kappa (\vec{x}_1, \vec{x}_2) = \exp( -\frac{1}{2l^2} | \vec{x}_1 - \vec{x}_2 |^2 ),$$ where, $l$ is the lengthscale parameter that needs to be estimated.

In order to perform nonparametric regression, a GP prior is placed over the unknown function. The posterior distribution obtained by multiplying the prior with the likelihood is again a GP. The predictive distribution for a new test point ( $ \vec{x}_{t} $ ), given the entire dataset ($n$ tuples of the form $(\vec{x}, y)$) can be shown to be a Gaussian with the following mean ( $ \hat{f_{t}} $ ) and variance ( $ \mathbf{V}[f_{t}] $ ): $$ \hat{f_{t}} = {\vec{k}_{t}}^{T} (K + \sigma_{n}^2I)^{-1} \vec{y} $$ $$ \mathbf{V}[f_{t}] = \kappa (\vec{x}_{t}, \vec{x}_{t}) - {\vec{k}_{t}}^{T} (K + \sigma_{n}^2I)^{-1} \vec{k}_{t} $$ where,

- $K$ is a $n \times n$ matrix where the $ij$th entry is computed as $K_{ij} = \kappa (\vec{x}_i, \vec{x}_j)$,
- $\vec{y}$ is a $n \times 1$ vector formed by stacking the $n$ target values together from the dataset,
- $I$ is a $n \times n$ identity matrix,
- $\sigma^2$ is the noise variance,
- $ \vec{k}_{t}$ is a $n \times 1$ vector where each entry is computed as $ \vec{k}_{t_{i}} = \kappa (\vec{x}_i, \vec{x}_{t})$

In order to learn the different parameters (lengthscale(s) of the covariance kernel, the noise and signal variances), a common technique is to use the marginal likelihood maximization framework (this step can be thought of as the 'training' phase). The log marginal likelihood has the following form:

$$ LML = \log p(\vec{y} | X) = -\frac{1}{2} \vec{y}^T (K + \sigma_{n}^2I)^{-1}\vec{y} - \frac{1}{2} \log |K + \sigma_{n}^2I | - \frac{n}{2} \log 2\pi $$

If we consider $\vec{\theta}$ as the vector of parameters to estimate, gradient based methods (conjugate gradient of L-BFGS) can be employed after obtaining suitable analytic gradients $ \frac{\partial LML}{ \partial \vec{\theta}_i}$.

The main bottleneck for both training and prediction phases is the matrix inversion step which requires $\mathcal{O}(n^3)$ flops, thus limiting the application of GPs for large datasets having $ n > 10000$ samples.

Our main task in this project is to exploit the data parallelism available while constructing the $K$ matrix, as well as efficiently computing the matrix inverse. If time permits we would like to look at recent approximate approaches that might be more suitable for distributed GPU setting. [DN15, WN15]

### Challenges

- Understand Gaussian Processes and then design an efficient implementation for performing parallel Regression on the same.
- Gaussian Processes rely heavily on matrix operations, such as matrix multiplication matrix inversion. Thus, our main challenge will be to ensure that these bandwidth-bound operations are done efficiently on a GPU.
- We hope to leverage currently available matrix operation routines provided by Magma [reference]. However, the Magma routines are built for general purpose matrix operations, and may have to sacrifice on application-specific optimizations. Therefore, another challenge for us is to try and exploit data access patterns and data locality to squeeze out performance while using general-purpose libraries.
- We understand that achieving linear speedup as the number of cores on the GPU increases may be difficult to achieve due to the inherent nature of the algorithm to update some global state during each phase of the training phase. Therefore, much of our focus will be on achieving maximum acceleration on each individual iteration of the training phase.
- Another challenge will be to ensure that the system scales as the size of the input and the processing power increase. Now-a-days, the main focus is not on raw speedup obtained, but speedup obtained per dollar spent. Therefore, another challenge will be to explore whether it is truly better to implement Gaussian Processes on a GPU, or are we better off on a CPU.
- The limitations of resources such as shared memory will limit the scaling that we may able to achieve as the problem size grows. However, given the resources that we have, we should be able to achieve good speedups for workloads upto a certain point beyond which the resources will make it tough to keep going at the same level of performance.

### Plan to achieve

- Implement a correct, parallel, and efficient implementation of Gaussian Processes on a wide variety of data sets.
- Implement at least 3 covariance kernels.
- Implement code to report metrics of the system. For instance, total job time, statistics about the utilization of compute resources (including divergence statistics), memory bandwidth achieved, accuracy obtained, etc. We will also compare the performance achieved by using different kernels which can potentially help the user to tune the choice of using kernels. For instance, the Squared Exponential Kernel may work well on a certain type of workload, or the Rational Quadratic Kernel may work well on certain other types of workloads etc.
- Compare the system to an efficient serial version that runs on a CPU. Also, we understand that just calculating speedups when code is run on CPU vs GPU is not the correct way of judging how well a system has been designed. Therefore, as discussed previously, we will do a “throughput per dollar” comparison for both platforms and draw conclusions from the same.
- Design and implement several test suites. These will be used to measure the performance of the code and to compare it to the baseline models.

### Presentation

We will show the performance of the system under several workloads, and we will compare it to a serial version implemented on a CPU. We will show several graphs for different test data sets.

### Platform Choice

We will use the C++ programming language for this project along with the Nvidia CUDA library. We will develop our code on the latedays cluster, since the cluster has the fastest Nvidia GPUs (Nvidia TitanX GPU) on campus, and we would be focusing on obtaining maximum performance for the hardware on latedays.

### Resources

We will start our code base from scratch.

### Schedule

- 2nd April
- Install Magma, Lapack, Atlas, ACML, BLAS, CBLAS. -
**DONE (but not really needed for us)** - 8th April
- Get a sequential code for GP working. Implement at least 2 Covariance Kernels -
**DONE** - Test it on chosen workloads to get a fair estimate of how it performs -
**DONE for sine dataset** - Work on Design of the GPU version of the system, and discuss how to port the code to run on GPU -
**DONE** - 15th April (checkpoint)
- Implement one more covariance kernel -
**PUSHED TO LOWER PRIORITY** - Should have a naively parallelized code running on GPU -
**DONE** - Keep performing regression tests to ensure that the porting of code is correct -
**DONE** - 22nd April
- Reevaluate design of parallel code and optimize for performance, locality, access patterns -
**IN PROGRESS** - Read research papers to extend our current design and possibly implement different strategies -
**IN PROGRESS** - Start work on performance evaluation to find places where the framework takes a long time. Find bottlenecks -
**IN PROGRESS** - 29th April
- Investigate scalability of the framework, both - with increasing number of cores, and with increasing data sets. Does the framework scale?
- Investigate effects of different covariance kernels. Find reasons for why they perform as they do.
- 4th May
- Finish all work
- Evaluate system performance and fine tune different parameters
- 7th May
- Code drop in
- Finish writeup