A lightweight C++ implementation of the Brain Floating Point (bfloat16) format.
bfloat16 is a 16-bit floating point format developed by Google Brain for use in machine learning applications. It preserves the dynamic range of 32-bit floating point (using the same 8-bit exponent) while reducing precision by storing only 7 bits of mantissa (compared to 23 bits in float32).
- Header-only C++ implementation
- Conversions between bfloat16 and float32
- Basic arithmetic operations
- Standard mathematical functions
- IEEE 754 compatibility
This implementation represents bfloat16 as:
- 1 sign bit
- 8 exponent bits (same as float32)
- 7 mantissa bits
The format provides a balance between range and precision that is particularly suitable for neural network training and inference.
This is a header-only library. Simply include the header files in your project.
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Brain bfloat16 specification
- TensorFlow bfloat16 implementation
- NVIDIA's bfloat16 implementation
Contributions are welcome! Please feel free to submit a Pull Request.