Checkpoint Update

Updated Schedule

Final Demo

For the final demo, we will show graphs of our system's performance and show speedups as against the serial version.

Preliminary Results

The figure below shows the time it takes to obtain the Cholesky decomposition for 2 different dataset sizes (N = 500 and N = 1000 samples) for the serial GP implementation vs the naive GPU implemnentation. We see that the speedup for N = 1000 is around 10x. comparison of Cholesky The figure below shows the time taken for computing the log marginal likelihood for the 2 implementations. The worse performance of our naive GPU implementation is as expected. This is because our forward subsitution and backward substition modules are run by a single CUDA thread (due to the inherent sequential nature of the algorithm itself). This time includes the time taken by the Cholesky and the forward-backward substitution routines. comparison of log likelihood

Current Challenges

We have identified 2 key challenges that have to be addressed for an efficient final CUDA implementation:

Nice-to-have

We are contemplating moving to a distributed enviroment once we optimize our single node performance. We would then be able to extract extremely high performance from multiple GPUs. However, this is definitely a stretch goal for us, and would be undertaken only after we feel confident about our single node performance.