A High-Performance CUDA GEMM Optimization Learning Project
A lightweight neural network inference engine focused on GEMM (General Matrix Multiply) optimization. This project demonstrates progressive optimization techniques to achieve high-performance matrix multiplication on NVIDIA GPUs, serving as an excellent learning resource for GPU programming and deep learning system optimization.
- Educational: Learn GPU optimization techniques step by step
- Practical: Achieve 70-80% of cuBLAS performance
- Complete: Full inference pipeline from weights to predictions
| Level | Technique | Description | Expected Speedup |
|---|---|---|---|
| 1 | Naive | Baseline implementation | 1x |
| 2 | Tiled | Shared memory tiling | 5-10x |
| 3 | Coalesced | Memory coalescing | +20% |
| 4 | Double Buffer | Latency hiding | +15% |
| 5 | Register Blocked | Register-level tiling | +50% |
| 6 | Fused | Kernel fusion | +30% |
| 7 | Vectorized | float4 loads | +10% |
- Half Precision (FP16): Mixed precision with FP32 accumulation
- INT8 Quantization: Weight compression and calibration
- Auto-Tuner: Automatic kernel selection
- Stream Manager: Multi-stream concurrent execution
- Batched GEMM: Parallel matrix multiplications
- Memory Pool: Efficient GPU memory management
- Profiler: Roofline model analysis
- CUDA Toolkit 11.0+ (tested with 13.1)
- CMake 3.18+
- C++17 compatible compiler
- NVIDIA GPU with Compute Capability 7.5+ (Turing, Ampere, Ada, Hopper)
# Clone the repository
git clone https://github.com/yourusername/mini-inference-engine.git
cd mini-inference-engine
# Build with tests
mkdir build && cd build
cmake ..
make -j$(nproc)
# Build without tests (faster, no network required)
cmake -DBUILD_TESTS=OFF ..
make -j$(nproc)./build/benchmarkSample output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Mini-Inference Engine GEMM Benchmark โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GPU: NVIDIA GeForce RTX 3060
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Matrix Size: M=1024, N=1024, K=1024 (2.15 GFLOPs)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
Kernel | Time | Performance | vs cuBLAS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
cuBLAS | 0.120 ms | 17916.67 GFLOPS | 100.0%
Naive | 2.450 ms | 877.55 GFLOPS | 4.9%
Tiled | 0.890 ms | 2415.73 GFLOPS | 13.5%
Optimized | 0.165 ms | 13030.30 GFLOPS | 72.7%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
./build/mnist_demo./build/detailed_benchmark./build/testsmini-inference-engine/
โโโ include/
โ โโโ common.h # Core data structures
โ โโโ kernels.cuh # CUDA kernel declarations
โ โโโ inference_engine.h # Inference engine interface
โ โโโ tensor.h # N-dimensional tensor class
โ โโโ memory_pool.h # GPU memory pool
โ โโโ stream_manager.h # CUDA stream management
โ โโโ batch_gemm.h # Batched GEMM operations
โ โโโ quantization.h # INT8 quantization
โ โโโ vectorized_gemm.cuh # Vectorized GEMM
โ โโโ half_gemm.cuh # FP16 GEMM
โ โโโ profiler.h # Performance profiler
โ โโโ autotuner.h # Auto-tuner
โ โโโ logger.h # Logging system
โ โโโ config.h # Configuration management
โโโ src/
โ โโโ naive_matmul.cu # Level 1: Naive
โ โโโ tiled_gemm.cu # Level 2: Tiled
โ โโโ coalesced_gemm.cu # Level 3: Coalesced
โ โโโ double_buffer_gemm.cu # Level 4: Double buffer
โ โโโ optimized_gemm.cu # Level 5: Register blocked
โ โโโ fused_gemm.cu # Level 6: Fused
โ โโโ vectorized_gemm.cu # Level 7: Vectorized
โ โโโ half_gemm.cu # FP16 implementation
โ โโโ tensor.cu # Tensor operations
โ โโโ benchmark.cu # Benchmark utilities
โ โโโ inference_engine.cpp # Inference engine
โโโ tests/
โ โโโ test_gemm.cpp # GEMM correctness
โ โโโ test_fusion.cpp # Fusion tests
โ โโโ test_tensor.cpp # Tensor tests
โ โโโ test_memory_pool.cpp # Memory pool tests
โ โโโ test_quantization.cpp # Quantization tests
โ โโโ ...
โโโ benchmarks/
โ โโโ benchmark.cpp # Main benchmark
โ โโโ detailed_benchmark.cu # Detailed analysis
โ โโโ mnist_demo.cpp # MNIST demo
โโโ scripts/
โ โโโ export_mnist_weights.py
โโโ CMakeLists.txt
Global Memory โ Shared Memory โ Registers โ Compute
Reduces global memory accesses from O(MรNรK) to O(MรNรK/TILE_SIZE).
Ensures threads in a warp access consecutive memory addresses:
Thread 0 โ Address 0
Thread 1 โ Address 4
Thread 2 โ Address 8
...
Buffer A: Load tile[i+1] | Buffer B: Compute tile[i]
Buffer A: Compute tile[i+1] | Buffer B: Load tile[i+2]
Overlaps computation with memory transfers.
Each thread computes a TMรTN tile:
Thread registers: [TMรTN output values]
[TM values from A]
[TN values from B]
Separate: GEMM โ Store โ Load โ Bias โ Store โ Load โ ReLU
Fused: GEMM โ Bias โ ReLU (single kernel)
Eliminates intermediate memory traffic.
float4 a = *reinterpret_cast<float4*>(&A[idx]); // 128-bit load4x fewer memory transactions.
| Matrix Size | Recommended BMรBNรBK |
|---|---|
| Small (<512) | 64ร64ร8 |
| Medium (512-2048) | 128ร128ร8 |
| Large (>2048) | 128ร256ร16 |
| Architecture | Recommended Settings |
|---|---|
| Volta (SM 7.0) | BM=128, BN=128, use_tensor_cores=false |
| Turing (SM 7.5) | BM=128, BN=128, use_tensor_cores=true |
| Ampere (SM 8.0) | BM=128, BN=256, use_async_copy=true |
| Ada (SM 8.9) | BM=128, BN=256, use_async_copy=true |
- Alignment: Ensure 128-byte alignment for vectorized loads
- Bank Conflicts: Pad shared memory to avoid conflicts
- Occupancy: Balance registers vs threads per block
# Profile with NVIDIA Nsight
nsys profile ./build/benchmark
# Detailed kernel analysis
ncu --set full ./build/benchmarkContributions are welcome! Areas for improvement:
- Tensor Core support (WMMA)
- Multi-GPU support
- More activation functions
- ONNX model loading
- INT4 quantization
- Quick Start Guide - ๅฟซ้ๅ ฅ้จ
- Architecture - ๆถๆ่ฎพ่ฎก
- GEMM Optimization - GEMM ไผๅ่ฏฆ่งฃ
- Performance Tuning - ๆง่ฝ่ฐไผๆๅ
- API Reference - API ๅ่
- Contributing - ่ดก็ฎๆๅ
MIT License
This project is inspired by:
- NVIDIA CUTLASS library
- Simon Boehm's CUDA optimization tutorials
- Various GPU programming courses and papers