cuGP by sgoyal78 and abhishekjoshi2

Checkpoint Update

Implementing Gaussian Process Regression in C++:

As planned, we have successfully implemented Gaussian Process Regression in C++. This proved to be harder than what we had expected, since many of the implementation intricacies cropped up as we started coding each step. We have tested our code against an implementation of Gaussian Process Regression in Matlab (GPML toolbox by Rasmussen and Nickisch), and we have achieved full accuracy in all stages of the computation.

Implementing 2 Covariance Kernels:

We have implemented 2 covariance kernels: Squared Exponential Kernel along with the Noise Kernel. For now, we have pushed the task of implementing the third covariance kernel to a lower priority task, since we don't think it is crucial to our task of "parallelizing Gaussian Process", and would be more suited to a primarily ML-focused project. Moreover, implementing SE Kernel + Noise Kernel is sufficient for implementing the whole training and regression task.

Naively parallelizing Gaussian Process in CUDA:

As planned, we have also successfully ported a naive version of Gaussian Process regression code in CUDA. This involved breaking down many of the steps of the computation into modular and reusable parts of code that are amenable to be run on CUDA. We will now briefly describe the different kernels that we have implemented as a part of the whole stack. For all kernels, X is the input training matrix, N is the number of training samples, and DIM is the number of parameters for each training sample:

compute_K_train:

We launch N*N threads - one for each element in the N*N covariance matrix. Each thread picks up its required input working set from the X matrix and fills up the (i,j)th entry in the K matrix, which is the covariance matrix. This can be further improved by collaborative loading of X into shared memory.

compute_cholesky:

We use Cholesky factorization to factorize the K matrix into a lower triangular matrix L (and its corresponding upper triangular matrix U). Cholesky factorization is further broken down into a series of kernels such as taking offsetted transpose, rectangular forward substitution, generic matrix transpose, matrix multiplication, and offsetted matrix subtraction. We implemented parallel Cholesky by referring to the algorithm given in [DA14]. Currently, for taking the pure Cholesky in the first step of the recursive process, we have fixed our b parameter to be 2. Going forward, we hope to try doing the same for b = 3 and b = 4.

matrix_transpose:

Matrix transpose is needed at several points during training, and therefore we have written a naive implementation of matrix transpose in CUDA, where the (i, j)th thread fills up the (j, i)th entry in the transpose matrix. We are not sure if there are better ways to do this, and we will be on the look out for a better implementation.

matrix_multiplication:

Matrix multiplication is another important piece of our implementation, and currently we have a naively parallel version of matrix multiplication where we launch one thread for every element of the output matrix. Going forward, we plan to implemented the same using shared memory and collaboratively loading data sets into shared memory.

rectangular_forward_substitution:

Forward substitution helps us solve equations of the form AX = B where A, X, B are matrices, A and B are known, and X is unknown. For now, we launch one thread for each column of X, since each column is independent of each other.

determinant_lower_triangular_matrix:

We need the determinant of the lower triangular matrix U. We thought we might have to implement a parallel version for finding the determinant, but later realized that the determinant can be naively taken by taking the square of the product of the diagonal elements! We launch one thread for taking the determinant, which sequentially walks the whole diagonal and calculates the product of all the elements.

vector_forward_substution and vector_backward_substitution:

This is the most sequential part of the whole system, and is currently a major bottleneck. For now, we have an extremely naively parallel version in CUDA, which doesn't perform well. We had expected this, and will now implement a different algorithm for implementing these two kernels, which is more amenable to parallelizing on CUDA.

compute_squared_distance, elementwise_matrix_mult, subtract_matrices, get_outer_product:

All these kernels have been naively parallelized by launching one thread for each element, since each of them can be done without any dependencies.

Schedule and Progress

Updated Schedule

22nd April

Improve performance of all kernels except vector forward and backward substitution. Use blocking and shared memory as much as possible.
Start work on performance evaluation to find places where the framework takes a long time. Find bottlenecks.

26th April

Improve performance of vector forward and backward substitution

30th April

Start work on distributed version of cuGP if possible. Otherwise, continue work on forward and backward substitution optimization. (This is exam week, so we don't want to go overboard on our tasks).

4th May

Finish all work
Evaluate system performance and fine tune different parameters

7th May

Code drop in
Finish writeup

Final Demo

For the final demo, we will show graphs of our system's performance and show speedups as against the serial version.

Preliminary Results

The figure below shows the time it takes to obtain the Cholesky decomposition for 2 different dataset sizes (N = 500 and N = 1000 samples) for the serial GP implementation vs the naive GPU implemnentation. We see that the speedup for N = 1000 is around 10x. comparison of Cholesky The figure below shows the time taken for computing the log marginal likelihood for the 2 implementations. The worse performance of our naive GPU implementation is as expected. This is because our forward subsitution and backward substition modules are run by a single CUDA thread (due to the inherent sequential nature of the algorithm itself). This time includes the time taken by the Cholesky and the forward-backward substitution routines. comparison of log likelihood

Current Challenges

We have identified 2 key challenges that have to be addressed for an efficient final CUDA implementation:

Cholesky decomposition

In the computation of the log marginal-likelihood we need to perform Cholesky decomposition of the covariance matrix. As of now we have implemented a blocked-recursive algorithm for computing it. However, we think that a better use of shared memory and a hybrid algorithm would surely be beneficial.

Triangular Matrix Inversion

Triangular matrix inversion (TMI) basically comprises of the forward-substitution routines for both vectors as well as matrices. Due to the inherently sequential algorithm for solving forward/backward substitution for a vector, currently, our CUDA kernel implemention for it is quite naive. We have gone through some recent papers, and have found a recursive algorithm that seems to be more CUDA friendly. So one of our primary goals is to implement an efficient triangular matrix inversion module.

Nice-to-have

We are contemplating moving to a distributed enviroment once we optimize our single node performance. We would then be able to extract extremely high performance from multiple GPUs. However, this is definitely a stretch goal for us, and would be undertaken only after we feel confident about our single node performance.