This repository contains selected solutions for Aalto University's Programming Parallel Computers course. Both of these solutions have the first place on the performance contest leaderboard.
The cp5 exercise involves computing the Pearson correlation coefficient between each row vector of the input matrix
We normalize each row of the input matrix by first computing their means and sums of squared deviations with a warp-collective single-precision Welford's algorithm, merge the warp statistics with XOR shuffles, and then do a second pass over the rows to normalize. The normalization kernel is grid-strided and launched as a single wave to reduce wave quantization effects.
For the exercise, it is only necessary to compute the upper triangular part of
For the so6 exercise, the task is to implement an efficient parallel algorithm for sorting 64-bit unsigned integers. We implement the Onesweep algorithm by Adinets and Merrill [1], which combines LSD radix sort with efficient inter-block communication. Our implementation achieves 5.67GB/s or 711.8Melem/s on-device throughput on the target hardware (NVIDIA Quadro RTX 4000).
[1]: Adinets, A., & Merrill, D. (2022). Onesweep: A faster least significant digit radix sort for gpus. arXiv:2206.01784.