Date: May 2025
Author: Saurav Verma
Course Project: High Performance Computing, University of Arizona (CUDA implementation and analysis)
This project investigates GPU-based parallelization of the Gallager-B (GaB) bit-flipping decoder for LDPC codes, comparing serial CPU and multiple CUDA GPU implementations.
We systematically benchmarked memory hierarchy strategies (Global, Shared, Constant) and control flows (Host-iterations, Device-iterations, Streaming, Batched Streaming).
Key Result: Batched Streaming achieved ~460× throughput improvement over baseline GPU decoding Final Report (PDF).
root/
├── data/
│ └── [Static data files for GaB algorithm: data structures, codeword sizes, etc.]
├── libwb/
│ └── [Standard libwb classes (taken from assignment code)]
├── results/
│ └── [Sample outputs and results generated when running the Slurm script]
├── src
│ └── CUDA and C implementations (serial + GPU variants)
├── docs
│ └── Report, Proposal, Slides
├── videos
└── Screen recordings of compilation & execution
- Baseline: Serial CPU GaB decoder.
- GPU Approaches:
- Global memory kernels
- Shared memory variant
- Constant memory variant
- Streaming with multiple CUDA streams
- Batched streaming (best-performing)
- Metrics:
- Time per frame (latency)
- Total runtime (throughput)
- Bit Error Rate (BER) and Frame Error Rate (FER) for verification
- Streaming improved throughput by ~3× compared to non-streamed GPU versions.
- Batched Streaming (batch=100, 3–5 streams) achieved 460× speedup in total runtime .
- Shared/constant memory did not yield improvements due to access patterns.
Figure: Tanner Graph representation of LDPC code
Figure: Overview of global, shared, streaming, and batched-streaming implementations
Figure: Batched Streaming achieves lowest latency across α values
Figure: Batched Streaming ~460× faster than baseline GPU implementation
Figure: Compute time Vs Data transfer time across CUDA variants.
- videos/run_demo.mp4: compilation + execution
- videos/validation.mp4: CUDA profiling + validation session
Example run:
./batchedStreaming ../data/IRISC_dv4_R050_L54_N1296_Dform 3 100- Shared/constant memory did not yield speedups due to access divergence. Keep constant memory DS small.
- Kernel launch reduction isn’t always faster—large intra-kernel loops can add sync overhead.
- Concurrency + batching gave the best results.