[Draft] Add GPU BFloat16 (BF16) support with device-side conversion #25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds native BFloat16 (BF16) support to Tiled-MM for GPU backends (CUDA and ROCm), enabling mixed-precision GEMM operations with hardware-accelerated Tensor Core/Matrix Core execution.
Motivation
Modern GPUs (NVIDIA Ampere+, AMD CDNA2+) provide hardware-accelerated BF16 compute with 2-8× performance improvements over FP32:
Current Tiled-MM only supports FP32/FP64 GEMM on GPU. This PR enables BF16 input/output with FP32 accumulation, matching the industry-standard mixed-precision pattern used by PyTorch, TensorFlow, and other frameworks.
Changes Summary
1. BF16 Conversion Kernels (
bf16_convert.{hpp,cu,hip})New files:
bf16_convert.hpp(69 lines): Cross-platform API for FP32 ↔ BF16 conversionbf16_convert.cu(104 lines): CUDA implementation using__float2bfloat16intrinsicsbf16_convert.hip(109 lines): ROCm implementation usingfloat_to_bfloat16intrinsicsKey features:
API:
2. GEMM Wrapper Integration (
tiled_mm.cpp)New wrapper function:
Execution flow:
cublas_gemm_wrapper_bf16(BF16 × BF16 → FP32 via Tensor Cores)Template instantiation:
3. Build System Integration (
CMakeLists.txt)Conditional compilation:
4. GPU BLAS API Header (
gpu_blas_api.hpp)Unified type definitions:
Technical Details
Mixed Precision Pattern
Why this pattern:
Memory Management
Current implementation:
cudaMalloc(&c_fp32_device, m * n * sizeof(float))Future optimization:
mm_handle<bf16_convert::BF16Type>Hardware Requirements
NVIDIA:
AMD:
Performance Characteristics
Expected Speedup
Memory Savings
Permanent storage: 50% reduction (BF16 vs FP32)
Temporary during GEMM: 2× overhead (FP32 output buffer)
Net benefit: 17% memory savings during computation, 50% at rest
Conversion Overhead
Kernel launch: ~5-10 μs (negligible)
Throughput: ~1 TB/s on A100/MI200
8192×8192 matrix: 256 MB → ~0.25 ms conversion time
GEMM time: ~10-50 ms (matrix size dependent)
Overhead: <1% for large matrices
Integration with Downstream Projects
COSMA Integration
This PR is part of a broader effort to add BF16 support to COSMA. The integration flow:
Build Integration
Downstream projects enable BF16 support via CMake:
Testing Status
Requires GPU hardware (Ampere or CDNA2+)
Planned tests:
Integration tests:
Known Limitations
Memory allocation: Per-call allocation (not optimal for small matrices)
mm_handleComplex types: No
complex<bfloat16>supportHardware detection: No runtime check for Tensor Core availability
Error handling: Basic CUDA error checks
Breaking Changes
None. This PR is purely additive:
TILED_MM_HAS_BF16_SUPPORTflag)Checklist
Related Work
COSMA BF16 Support:
Industry References:
Request for Review
This PR is marked as DRAFT pending:
Questions for reviewers:
Author
David Sanftenberg (@dbsanfte)
Email: david.sanftenberg@gmail.com
Status: 🚧 DRAFT - Implementation complete, testing pending GPU hardware access