CUDA High-Performance Demos is a collection of CUDA‐accelerated vector and matrix routines showcasing advanced optimization techniques:
- Shared memory tiling for coalesced global loads
- Warp‐level optimization to minimize divergence
- Tiling strategies to maximize data reuse and throughput
Each demo comes in a baseline version and a shared‐memory‐optimized (SH_) variant, so you can compare performance gains side by side.
- NVIDIA GPU with Compute Capability ≥ 5.0
- CUDA Toolkit (≥ 10.0) installed and in your
PATH - Docker installation
git clone https://github.com/Orlando275/CUDA-high-performance-demos.git
cd CUDA-high-performance-demosYou can pull and run the latest version of the CUDA-high-performance-demos from Docker Hub:
docker pull orlando2705/cuda-high-perf-demos:v1.1
docker run --rm --gpus all orlando2705/cuda-high-perf-demos:v1.1nvcc SH_matrix_multiplication.cu -o matrix_multiplication
./matrix_multiplication200 20 120 # M x N, N x P./sum_of_vectors
10000000
./SH_total_vector_sum
10000000- SH_ variants use shared memory tiling to reduce global memory traffic.
- Warp‐level primitives for fast reductions and minimized divergence.
- Parameterizable block/grid sizes for auto‐tuning.
- Side‐by‐side baseline vs. optimized implementations for performance comparison.
CUDA-high-performance-demos/ ├── Vectors/ │ ├── normalize_vector.cu │ ├── SH_normalize_vector.cu │ ├── SH_total_vector_sum.cu | └── sum_of_vectors.cu ├── Matrices/ │ ├── matrix_multiplication.cu │ └── SH_matrix_multiplication.cu ├── .gitignore ├── Dockerfile ├── README.md
- Input size:
33,554,432elements - Kernel time:
54.28 ms - Description:
Each thread processes a single element. This approach suffers from poor scalability and inefficient memory usage.
- Input size:
33,554,432elements - Kernel time:
1.61 ms - Description:
Using the grid-stride loop pattern, each thread handles multiple elements.
This drastically improves resource utilization and reduces execution time by ~34× compared to the naive version.
- Input size:
33,554,432elements - Kernel time (all stages):
2.98 ms - Description:
| Implementation | Input Size | Kernel Time | Speedup vs Naive |
|---|---|---|---|
| Vector Addition – Naive | 33,554,432 | 54.28 ms | 1× |
| Vector Addition – Grid-Stride Loop | 33,554,432 | 1.61 ms | ~34× |
| Vector Normalization – Shared Memory + Warp Optimization | 33,554,432 | 2.98 ms | ~18× |
- Grid-stride loops are a simple yet powerful optimization for memory-bound kernels.
- Shared memory + warp-level primitives are essential for high-performance reductions.
- Even with the same input size, kernel design can yield orders of magnitude performance differences.
- Baseline Execution: Runs kernels that read and write directly from global memory without shared memory usage.
- Shared Memory Tiling: Optimized versions split data into tiles stored in shared memory, processed cooperatively by threads, then written back to global memory.
- Warp-Level Optimization: Uses warp-level primitives like
__shfl_down_syncto perform fast intra‑warp reductions without shared memory. - Performance Comparison: Each demo includes both baseline and optimized versions to measure execution time, throughput, and the impact of shared memory and warp-level operations.
- CUDA C/C++ – Core language for implementing high‑performance GPU kernels.
- NVIDIA CUDA Toolkit – Provides compiler (
nvcc), runtime libraries, and development utilities. - Shared Memory & Warp-Level Primitives – GPU optimization techniques for reduced latency and higher throughput.
- CUDA Events – For precise kernel execution timing and performance measurement.
- Docker – Containerization for consistent, portable builds and environment setup across systems.
- Optimize kernels using LLVM and custom PTX tuning for low‑level performance gains.
- Implement multi‑GPU synchronization and collective operations via NVIDIA NCCL for distributed execution.
- Add support for advanced AI‑related kernels such as softmax and common loss functions (e.g., cross‑entropy, MSE).
- Extend profiling and benchmarking suite to measure scalability across multiple GPUs.
- Provide Docker setup for reproducible, portable GPU development environments.

