Needed for fp8 training, and adds some nice fp16/bf16 optimizations for Ampere and newer architectures that we can make use of regardless. https://github.com/EleutherAI/TransformerEngine