Skip to content

LessUp/mini-inference-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Mini-Inference Engine

A High-Performance CUDA GEMM Optimization Learning Project

License: MIT CUDA C++ CMake

A lightweight neural network inference engine focused on GEMM (General Matrix Multiply) optimization. This project demonstrates progressive optimization techniques to achieve high-performance matrix multiplication on NVIDIA GPUs, serving as an excellent learning resource for GPU programming and deep learning system optimization.

๐ŸŽฏ Project Goals

  • Educational: Learn GPU optimization techniques step by step
  • Practical: Achieve 70-80% of cuBLAS performance
  • Complete: Full inference pipeline from weights to predictions

โœจ Features

GEMM Optimization Levels

Level Technique Description Expected Speedup
1 Naive Baseline implementation 1x
2 Tiled Shared memory tiling 5-10x
3 Coalesced Memory coalescing +20%
4 Double Buffer Latency hiding +15%
5 Register Blocked Register-level tiling +50%
6 Fused Kernel fusion +30%
7 Vectorized float4 loads +10%

Advanced Features

  • Half Precision (FP16): Mixed precision with FP32 accumulation
  • INT8 Quantization: Weight compression and calibration
  • Auto-Tuner: Automatic kernel selection
  • Stream Manager: Multi-stream concurrent execution
  • Batched GEMM: Parallel matrix multiplications
  • Memory Pool: Efficient GPU memory management
  • Profiler: Roofline model analysis

๐Ÿ“‹ Requirements

  • CUDA Toolkit 11.0+ (tested with 13.1)
  • CMake 3.18+
  • C++17 compatible compiler
  • NVIDIA GPU with Compute Capability 7.5+ (Turing, Ampere, Ada, Hopper)

๐Ÿ”ง Building

# Clone the repository
git clone https://github.com/yourusername/mini-inference-engine.git
cd mini-inference-engine

# Build with tests
mkdir build && cd build
cmake ..
make -j$(nproc)

# Build without tests (faster, no network required)
cmake -DBUILD_TESTS=OFF ..
make -j$(nproc)

๐Ÿš€ Running

Performance Benchmark

./build/benchmark

Sample output:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                    Mini-Inference Engine GEMM Benchmark                      โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
GPU: NVIDIA GeForce RTX 3060
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Matrix Size: M=1024, N=1024, K=1024 (2.15 GFLOPs)
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
          Kernel |         Time |    Performance |  vs cuBLAS
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
          cuBLAS |    0.120 ms |   17916.67 GFLOPS |    100.0%
           Naive |    2.450 ms |     877.55 GFLOPS |      4.9%
           Tiled |    0.890 ms |    2415.73 GFLOPS |     13.5%
       Optimized |    0.165 ms |   13030.30 GFLOPS |     72.7%
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

MNIST Demo

./build/mnist_demo

Detailed Analysis

./build/detailed_benchmark

Run Tests

./build/tests

๐Ÿ“ Project Structure

mini-inference-engine/
โ”œโ”€โ”€ include/
โ”‚   โ”œโ”€โ”€ common.h              # Core data structures
โ”‚   โ”œโ”€โ”€ kernels.cuh           # CUDA kernel declarations
โ”‚   โ”œโ”€โ”€ inference_engine.h    # Inference engine interface
โ”‚   โ”œโ”€โ”€ tensor.h              # N-dimensional tensor class
โ”‚   โ”œโ”€โ”€ memory_pool.h         # GPU memory pool
โ”‚   โ”œโ”€โ”€ stream_manager.h      # CUDA stream management
โ”‚   โ”œโ”€โ”€ batch_gemm.h          # Batched GEMM operations
โ”‚   โ”œโ”€โ”€ quantization.h        # INT8 quantization
โ”‚   โ”œโ”€โ”€ vectorized_gemm.cuh   # Vectorized GEMM
โ”‚   โ”œโ”€โ”€ half_gemm.cuh         # FP16 GEMM
โ”‚   โ”œโ”€โ”€ profiler.h            # Performance profiler
โ”‚   โ”œโ”€โ”€ autotuner.h           # Auto-tuner
โ”‚   โ”œโ”€โ”€ logger.h              # Logging system
โ”‚   โ””โ”€โ”€ config.h              # Configuration management
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ naive_matmul.cu       # Level 1: Naive
โ”‚   โ”œโ”€โ”€ tiled_gemm.cu         # Level 2: Tiled
โ”‚   โ”œโ”€โ”€ coalesced_gemm.cu     # Level 3: Coalesced
โ”‚   โ”œโ”€โ”€ double_buffer_gemm.cu # Level 4: Double buffer
โ”‚   โ”œโ”€โ”€ optimized_gemm.cu     # Level 5: Register blocked
โ”‚   โ”œโ”€โ”€ fused_gemm.cu         # Level 6: Fused
โ”‚   โ”œโ”€โ”€ vectorized_gemm.cu    # Level 7: Vectorized
โ”‚   โ”œโ”€โ”€ half_gemm.cu          # FP16 implementation
โ”‚   โ”œโ”€โ”€ tensor.cu             # Tensor operations
โ”‚   โ”œโ”€โ”€ benchmark.cu          # Benchmark utilities
โ”‚   โ””โ”€โ”€ inference_engine.cpp  # Inference engine
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_gemm.cpp         # GEMM correctness
โ”‚   โ”œโ”€โ”€ test_fusion.cpp       # Fusion tests
โ”‚   โ”œโ”€โ”€ test_tensor.cpp       # Tensor tests
โ”‚   โ”œโ”€โ”€ test_memory_pool.cpp  # Memory pool tests
โ”‚   โ”œโ”€โ”€ test_quantization.cpp # Quantization tests
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ benchmarks/
โ”‚   โ”œโ”€โ”€ benchmark.cpp         # Main benchmark
โ”‚   โ”œโ”€โ”€ detailed_benchmark.cu # Detailed analysis
โ”‚   โ””โ”€โ”€ mnist_demo.cpp        # MNIST demo
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ export_mnist_weights.py
โ””โ”€โ”€ CMakeLists.txt

๐Ÿ“š Optimization Techniques Explained

1. Tiling with Shared Memory

Global Memory โ†’ Shared Memory โ†’ Registers โ†’ Compute

Reduces global memory accesses from O(Mร—Nร—K) to O(Mร—Nร—K/TILE_SIZE).

2. Memory Coalescing

Ensures threads in a warp access consecutive memory addresses:

Thread 0 โ†’ Address 0
Thread 1 โ†’ Address 4
Thread 2 โ†’ Address 8
...

3. Double Buffering

Buffer A: Load tile[i+1] | Buffer B: Compute tile[i]
Buffer A: Compute tile[i+1] | Buffer B: Load tile[i+2]

Overlaps computation with memory transfers.

4. Register Blocking

Each thread computes a TMร—TN tile:

Thread registers: [TMร—TN output values]
                  [TM values from A]
                  [TN values from B]

5. Kernel Fusion

Separate: GEMM โ†’ Store โ†’ Load โ†’ Bias โ†’ Store โ†’ Load โ†’ ReLU
Fused:    GEMM โ†’ Bias โ†’ ReLU (single kernel)

Eliminates intermediate memory traffic.

6. Vectorized Loads

float4 a = *reinterpret_cast<float4*>(&A[idx]);  // 128-bit load

4x fewer memory transactions.

๐Ÿ”ฌ Performance Tuning Guide

Choosing Block Sizes

Matrix Size Recommended BMร—BNร—BK
Small (<512) 64ร—64ร—8
Medium (512-2048) 128ร—128ร—8
Large (>2048) 128ร—256ร—16

GPU Architecture Considerations

Architecture Recommended Settings
Volta (SM 7.0) BM=128, BN=128, use_tensor_cores=false
Turing (SM 7.5) BM=128, BN=128, use_tensor_cores=true
Ampere (SM 8.0) BM=128, BN=256, use_async_copy=true
Ada (SM 8.9) BM=128, BN=256, use_async_copy=true

Memory Optimization

  1. Alignment: Ensure 128-byte alignment for vectorized loads
  2. Bank Conflicts: Pad shared memory to avoid conflicts
  3. Occupancy: Balance registers vs threads per block

๐Ÿ“Š Benchmarking Tips

# Profile with NVIDIA Nsight
nsys profile ./build/benchmark

# Detailed kernel analysis
ncu --set full ./build/benchmark

๐Ÿค Contributing

Contributions are welcome! Areas for improvement:

  • Tensor Core support (WMMA)
  • Multi-GPU support
  • More activation functions
  • ONNX model loading
  • INT4 quantization

๐Ÿ“– References

๐Ÿ“š Documentation

๐Ÿ“„ License

MIT License

๐Ÿ™ Acknowledgments

This project is inspired by:

  • NVIDIA CUTLASS library
  • Simon Boehm's CUDA optimization tutorials
  • Various GPU programming courses and papers

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published