Mini-Inference Engine

A High-Performance CUDA GEMM Optimization Learning Project

A lightweight neural network inference engine focused on GEMM (General Matrix Multiply) optimization. This project demonstrates progressive optimization techniques to achieve high-performance matrix multiplication on NVIDIA GPUs, serving as an excellent learning resource for GPU programming and deep learning system optimization.

🎯 Project Goals

Educational: Learn GPU optimization techniques step by step
Practical: Achieve 70-80% of cuBLAS performance
Complete: Full inference pipeline from weights to predictions

✨ Features

GEMM Optimization Levels

Level	Technique	Description	Expected Speedup
1	Naive	Baseline implementation	1x
2	Tiled	Shared memory tiling	5-10x
3	Coalesced	Memory coalescing	+20%
4	Double Buffer	Latency hiding	+15%
5	Register Blocked	Register-level tiling	+50%
6	Fused	Kernel fusion	+30%
7	Vectorized	float4 loads	+10%

Advanced Features

Half Precision (FP16): Mixed precision with FP32 accumulation
INT8 Quantization: Weight compression and calibration
Auto-Tuner: Automatic kernel selection
Stream Manager: Multi-stream concurrent execution
Batched GEMM: Parallel matrix multiplications
Memory Pool: Efficient GPU memory management
Profiler: Roofline model analysis

📋 Requirements

CUDA Toolkit 11.0+ (tested with 13.1)
CMake 3.18+
C++17 compatible compiler
NVIDIA GPU with Compute Capability 7.5+ (Turing, Ampere, Ada, Hopper)

🔧 Building

# Clone the repository
git clone https://github.com/yourusername/mini-inference-engine.git
cd mini-inference-engine

# Build with tests
mkdir build && cd build
cmake ..
make -j$(nproc)

# Build without tests (faster, no network required)
cmake -DBUILD_TESTS=OFF ..
make -j$(nproc)

🚀 Running

Performance Benchmark

./build/benchmark

Sample output:

╔══════════════════════════════════════════════════════════════════════════════╗
║                    Mini-Inference Engine GEMM Benchmark                      ║
╚══════════════════════════════════════════════════════════════════════════════╝
GPU: NVIDIA GeForce RTX 3060
┌──────────────────────────────────────────────────────────────────────────────┐
│ Matrix Size: M=1024, N=1024, K=1024 (2.15 GFLOPs)
├──────────────────────────────────────────────────────────────────────────────┤
          Kernel |         Time |    Performance |  vs cuBLAS
├──────────────────────────────────────────────────────────────────────────────┤
          cuBLAS |    0.120 ms |   17916.67 GFLOPS |    100.0%
           Naive |    2.450 ms |     877.55 GFLOPS |      4.9%
           Tiled |    0.890 ms |    2415.73 GFLOPS |     13.5%
       Optimized |    0.165 ms |   13030.30 GFLOPS |     72.7%
└──────────────────────────────────────────────────────────────────────────────┘

MNIST Demo

./build/mnist_demo

Detailed Analysis

./build/detailed_benchmark

Run Tests

./build/tests

📁 Project Structure

mini-inference-engine/
├── include/
│   ├── common.h              # Core data structures
│   ├── kernels.cuh           # CUDA kernel declarations
│   ├── inference_engine.h    # Inference engine interface
│   ├── tensor.h              # N-dimensional tensor class
│   ├── memory_pool.h         # GPU memory pool
│   ├── stream_manager.h      # CUDA stream management
│   ├── batch_gemm.h          # Batched GEMM operations
│   ├── quantization.h        # INT8 quantization
│   ├── vectorized_gemm.cuh   # Vectorized GEMM
│   ├── half_gemm.cuh         # FP16 GEMM
│   ├── profiler.h            # Performance profiler
│   ├── autotuner.h           # Auto-tuner
│   ├── logger.h              # Logging system
│   └── config.h              # Configuration management
├── src/
│   ├── naive_matmul.cu       # Level 1: Naive
│   ├── tiled_gemm.cu         # Level 2: Tiled
│   ├── coalesced_gemm.cu     # Level 3: Coalesced
│   ├── double_buffer_gemm.cu # Level 4: Double buffer
│   ├── optimized_gemm.cu     # Level 5: Register blocked
│   ├── fused_gemm.cu         # Level 6: Fused
│   ├── vectorized_gemm.cu    # Level 7: Vectorized
│   ├── half_gemm.cu          # FP16 implementation
│   ├── tensor.cu             # Tensor operations
│   ├── benchmark.cu          # Benchmark utilities
│   └── inference_engine.cpp  # Inference engine
├── tests/
│   ├── test_gemm.cpp         # GEMM correctness
│   ├── test_fusion.cpp       # Fusion tests
│   ├── test_tensor.cpp       # Tensor tests
│   ├── test_memory_pool.cpp  # Memory pool tests
│   ├── test_quantization.cpp # Quantization tests
│   └── ...
├── benchmarks/
│   ├── benchmark.cpp         # Main benchmark
│   ├── detailed_benchmark.cu # Detailed analysis
│   └── mnist_demo.cpp        # MNIST demo
├── scripts/
│   └── export_mnist_weights.py
└── CMakeLists.txt

📚 Optimization Techniques Explained

1. Tiling with Shared Memory

Global Memory → Shared Memory → Registers → Compute

Reduces global memory accesses from O(M×N×K) to O(M×N×K/TILE_SIZE).

2. Memory Coalescing

Ensures threads in a warp access consecutive memory addresses:

Thread 0 → Address 0
Thread 1 → Address 4
Thread 2 → Address 8
...

3. Double Buffering

Buffer A: Load tile[i+1] | Buffer B: Compute tile[i]
Buffer A: Compute tile[i+1] | Buffer B: Load tile[i+2]

Overlaps computation with memory transfers.

4. Register Blocking

Each thread computes a TM×TN tile:

Thread registers: [TM×TN output values]
                  [TM values from A]
                  [TN values from B]

5. Kernel Fusion

Separate: GEMM → Store → Load → Bias → Store → Load → ReLU
Fused:    GEMM → Bias → ReLU (single kernel)

Eliminates intermediate memory traffic.

6. Vectorized Loads

float4 a = *reinterpret_cast<float4*>(&A[idx]);  // 128-bit load

4x fewer memory transactions.

🔬 Performance Tuning Guide

Choosing Block Sizes

Matrix Size	Recommended BM×BN×BK
Small (<512)	64×64×8
Medium (512-2048)	128×128×8
Large (>2048)	128×256×16

GPU Architecture Considerations

Architecture	Recommended Settings
Volta (SM 7.0)	BM=128, BN=128, use_tensor_cores=false
Turing (SM 7.5)	BM=128, BN=128, use_tensor_cores=true
Ampere (SM 8.0)	BM=128, BN=256, use_async_copy=true
Ada (SM 8.9)	BM=128, BN=256, use_async_copy=true

Memory Optimization

Alignment: Ensure 128-byte alignment for vectorized loads
Bank Conflicts: Pad shared memory to avoid conflicts
Occupancy: Balance registers vs threads per block

📊 Benchmarking Tips

# Profile with NVIDIA Nsight
nsys profile ./build/benchmark

# Detailed kernel analysis
ncu --set full ./build/benchmark

🤝 Contributing

Contributions are welcome! Areas for improvement:

📖 References

📚 Documentation

Quick Start Guide - 快速入门
Architecture - 架构设计
GEMM Optimization - GEMM 优化详解
Performance Tuning - 性能调优指南
API Reference - API 参考
Contributing - 贡献指南

📄 License

MIT License

🙏 Acknowledgments

This project is inspired by:

NVIDIA CUTLASS library
Simon Boehm's CUDA optimization tutorials
Various GPU programming courses and papers

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.kiro/specs/mini-inference-engine		.kiro/specs/mini-inference-engine
benchmarks		benchmarks
changelog		changelog
config		config
docs		docs
include		include
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-Inference Engine

🎯 Project Goals

✨ Features

GEMM Optimization Levels

Advanced Features

📋 Requirements

🔧 Building

🚀 Running

Performance Benchmark

MNIST Demo

Detailed Analysis

Run Tests

📁 Project Structure

📚 Optimization Techniques Explained

1. Tiling with Shared Memory

2. Memory Coalescing

3. Double Buffering

4. Register Blocking

5. Kernel Fusion

6. Vectorized Loads

🔬 Performance Tuning Guide

Choosing Block Sizes

GPU Architecture Considerations

Memory Optimization

📊 Benchmarking Tips

🤝 Contributing

📖 References

📚 Documentation

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

LessUp/mini-inference-engine

Folders and files

Latest commit

History

Repository files navigation

Mini-Inference Engine

🎯 Project Goals

✨ Features

GEMM Optimization Levels

Advanced Features

📋 Requirements

🔧 Building

🚀 Running

Performance Benchmark

MNIST Demo

Detailed Analysis

Run Tests

📁 Project Structure

📚 Optimization Techniques Explained

1. Tiling with Shared Memory

2. Memory Coalescing

3. Double Buffering

4. Register Blocking

5. Kernel Fusion

6. Vectorized Loads

🔬 Performance Tuning Guide

Choosing Block Sizes

GPU Architecture Considerations

Memory Optimization

📊 Benchmarking Tips

🤝 Contributing

📖 References

📚 Documentation

📄 License

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages