This repository contains multiple CUDA implementations of Bitonic Sort with various optimizations, along with a Radix Sort implementation for performance comparison. The project demonstrates GPU-accelerated sorting techniques, highlights optimization trade-offs, and analyzes performance using CUDA events and the NVIDIA Nsight profiler.
- Introduction
- Setup and Requirements
- Compilation and Usage
- Directory Structure
- Algorithm Versions
- Performance Metrics
- Debugging and Verification
- Makefile Targets
- Cleanup
This project implements four optimized versions of Bitonic Sort and a Radix Sort on NVIDIA GPUs using CUDA:
| Algorithm | Optimizations Applied |
|---|---|
| Bitonic Sort V0 | Baseline implementation using global memory |
| Bitonic Sort V1 | Kernel fusion for intra-block communication |
| Bitonic Sort V2 | Shared memory for efficient intra-block operations |
| Bitonic Sort V3 | Reduced thread spawn, warp shuffling, pinned memory |
| Radix Sort | Uses CUDA CUB library for comparison |
✅ Hardware:
- NVIDIA GPU with Compute Capability ≥ 3.5
✅ Software:
- CUDA Toolkit ≥ 11.0
- NVCC Compiler (part of CUDA Toolkit)
- Make build system
✅ Tested On:
- NVIDIA T4 (Compute 7.5)
- NVIDIA Tesla P100 (Compute 6.0)
- NVIDIA Ampere A100 (Compute 8.0)
Use the provided Makefile to build all implementations:
make allTo build specific versions:
make bitonic v=[0|1|2|3] # Build a specific Bitonic Sort version
make radix # Build Radix Sort✅ Executables are stored in the bin/ directory:
bitonic_v0,bitonic_v1,bitonic_v2,bitonic_v3(Bitonic Sort versions)radix_sort(Radix Sort)
Run with q specifying log₂(number of elements):
make run v=[0|1|2|3|radix] q=<value>Example (Sort 2²⁰ elements using Bitonic Sort V2):
make run v=2 q=20📦 Project Root
┣ 📂 src/ # Source code for Bitonic and Radix Sort implementations
┣ 📂 bin/ # Compiled executables
┣ 📂 results/ # Performance metrics and verification results
┣ 📂 docs/ # Documentation and report files
┣ 📜 Makefile # Build system
- Uses global memory for all operations.
- Simple one-thread-per-element approach.
- Serves as a performance baseline for optimizations.
- Fuses multiple sorting steps into a single kernel launch.
- Minimizes global synchronization overhead.
- Memory access pattern is still global memory dependent.
- Performance Improvement: Achieves a 30% speedup over V0.
- Utilizes fast shared memory for intra-block operations.
- Reduces global memory transactions, improving performance.
- Performance Improvement: Achieves a 20% speedup over V1.
- Minimized thread spawn in main sorting kernel.
- Optimized warp shuffle operations to reduce global synchronization.
- Pinned Memory Usage for efficient host-device transfers.
- Performance Improvement: Achieves a 20% speedup over V2.
- Uses CUB library for Radix Sort.
- Included for performance comparison with Bitonic Sort.
Performance is measured using CUDA events (cudaEventElapsedTime). Each test runs multiple times to ensure accurate measurement.
🔍 Profiling Tools Used:
- NVIDIA Nsight Profiler for kernel analysis
- CUDA event timers for execution time measurement
📈 Performance Graph (A100 GPU):

To enable/disable debugging and verification options, modify the following #define statements in the beginning of each source file:
#define DEBUG // Enables detailed output for debugging
#define VERIFY_SEQUENTIAL // Enables comparison with sequential CPU sort (for verification)| Command | Description |
|---|---|
make all |
Builds all Bitonic Sort versions and Radix Sort |
make bitonic v=[0|1|2|3] |
Builds a specific Bitonic Sort version |
make radix |
Builds Radix Sort |
make run v=[0|1|2|3|radix] q=<value> |
Runs the specified sorting algorithm with 2^q elements |
make clean |
Removes all compiled binaries and object files |
make help |
Displays help message |
To remove all compiled executables:
make clean