Parallel Gallager-B LDPC Decoding on CUDA

Date: May 2025
Author: Saurav Verma
Course Project: High Performance Computing, University of Arizona (CUDA implementation and analysis)

This project investigates GPU-based parallelization of the Gallager-B (GaB) bit-flipping decoder for LDPC codes, comparing serial CPU and multiple CUDA GPU implementations.

We systematically benchmarked memory hierarchy strategies (Global, Shared, Constant) and control flows (Host-iterations, Device-iterations, Streaming, Batched Streaming).

Key Result: Batched Streaming achieved ~460× throughput improvement over baseline GPU decoding Final Report (PDF).

Repository Structure

root/
├── data/
│    └── [Static data files for GaB algorithm: data structures, codeword sizes, etc.]
├── libwb/
│    └── [Standard libwb classes (taken from assignment code)]
├── results/
│    └── [Sample outputs and results generated when running the Slurm script]
├── src
│    └── CUDA and C implementations (serial + GPU variants)
├── docs
│    └── Report, Proposal, Slides
├── videos
     └── Screen recordings of compilation & execution

🧮 Methodology

Baseline: Serial CPU GaB decoder.
GPU Approaches:
- Global memory kernels
- Shared memory variant
- Constant memory variant
- Streaming with multiple CUDA streams
- Batched streaming (best-performing)
Metrics:
- Time per frame (latency)
- Total runtime (throughput)
- Bit Error Rate (BER) and Frame Error Rate (FER) for verification

📊 Results Highlights

Streaming improved throughput by ~3× compared to non-streamed GPU versions.
Batched Streaming (batch=100, 3–5 streams) achieved 460× speedup in total runtime .
Shared/constant memory did not yield improvements due to access patterns.

Tanner Graph (LDPC structure)

Figure: Tanner Graph representation of LDPC code

Explored Optimization Strategies

Figure: Overview of global, shared, streaming, and batched-streaming implementations

Time per Frame vs Noise (α)

Figure: Batched Streaming achieves lowest latency across α values

Combined Analysis – Total Runtime

Figure: Batched Streaming ~460× faster than baseline GPU implementation

Combined Analysis – Total Runtime

Figure: Compute time Vs Data transfer time across CUDA variants.

Demonstration

videos/run_demo.mp4: compilation + execution
videos/validation.mp4: CUDA profiling + validation session

Example run:

./batchedStreaming ../data/IRISC_dv4_R050_L54_N1296_Dform 3 100

Demonstration Reports and Slides

Lessons Learned

Shared/constant memory did not yield speedups due to access divergence. Keep constant memory DS small.
Kernel launch reduction isn’t always faster—large intra-kernel loops can add sync overhead.
Concurrency + batching gave the best results.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
libwb		libwb
results		results
src		src
videos		videos
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
run_gab.slurm		run_gab.slurm
sources.cmake		sources.cmake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel Gallager-B LDPC Decoding on CUDA

Repository Structure

🧮 Methodology

📊 Results Highlights

Tanner Graph (LDPC structure)

Explored Optimization Strategies

Time per Frame vs Noise (α)

Combined Analysis – Total Runtime

Combined Analysis – Total Runtime

Demonstration

Demonstration Reports and Slides

Lessons Learned

About

Uh oh!

Releases

Packages

Languages

License

sauravverma78/gallagher-b-cuda

Folders and files

Latest commit

History

Repository files navigation

Parallel Gallager-B LDPC Decoding on CUDA

Repository Structure

🧮 Methodology

📊 Results Highlights

Tanner Graph (LDPC structure)

Explored Optimization Strategies

Time per Frame vs Noise (α)

Combined Analysis – Total Runtime

Combined Analysis – Total Runtime

Demonstration

Demonstration Reports and Slides

Lessons Learned

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages