Skip to content

sauravverma78/gallagher-b-cuda

Repository files navigation

Parallel Gallager-B LDPC Decoding on CUDA

Date: May 2025
Author: Saurav Verma
Course Project: High Performance Computing, University of Arizona (CUDA implementation and analysis)

This project investigates GPU-based parallelization of the Gallager-B (GaB) bit-flipping decoder for LDPC codes, comparing serial CPU and multiple CUDA GPU implementations.

We systematically benchmarked memory hierarchy strategies (Global, Shared, Constant) and control flows (Host-iterations, Device-iterations, Streaming, Batched Streaming).

Key Result: Batched Streaming achieved ~460× throughput improvement over baseline GPU decoding Final Report (PDF).


Repository Structure

root/
├── data/
│    └── [Static data files for GaB algorithm: data structures, codeword sizes, etc.]
├── libwb/
│    └── [Standard libwb classes (taken from assignment code)]
├── results/
│    └── [Sample outputs and results generated when running the Slurm script]
├── src
│    └── CUDA and C implementations (serial + GPU variants)
├── docs
│    └── Report, Proposal, Slides
├── videos
     └── Screen recordings of compilation & execution


🧮 Methodology

  • Baseline: Serial CPU GaB decoder.
  • GPU Approaches:
    • Global memory kernels
    • Shared memory variant
    • Constant memory variant
    • Streaming with multiple CUDA streams
    • Batched streaming (best-performing)
  • Metrics:
    • Time per frame (latency)
    • Total runtime (throughput)
    • Bit Error Rate (BER) and Frame Error Rate (FER) for verification

📊 Results Highlights

  • Streaming improved throughput by ~3× compared to non-streamed GPU versions.
  • Batched Streaming (batch=100, 3–5 streams) achieved 460× speedup in total runtime .
  • Shared/constant memory did not yield improvements due to access patterns.

Tanner Graph (LDPC structure)


Figure: Tanner Graph representation of LDPC code


Explored Optimization Strategies


Figure: Overview of global, shared, streaming, and batched-streaming implementations


Time per Frame vs Noise (α)


Figure: Batched Streaming achieves lowest latency across α values


Combined Analysis – Total Runtime


Figure: Batched Streaming ~460× faster than baseline GPU implementation

Combined Analysis – Total Runtime


Figure: Compute time Vs Data transfer time across CUDA variants.


Demonstration

Example run:

./batchedStreaming ../data/IRISC_dv4_R050_L54_N1296_Dform 3 100

Demonstration Reports and Slides


Lessons Learned

  • Shared/constant memory did not yield speedups due to access divergence. Keep constant memory DS small.
  • Kernel launch reduction isn’t always faster—large intra-kernel loops can add sync overhead.
  • Concurrency + batching gave the best results.

About

Parallel batched Gallager-B decoder with CUDA kernels, streams, and single-block optimization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published