Production-ready FP8 training for LLaMA-Factory on NVIDIA Hopper/Blackwell GPUs (H100, H200, B200).
# Build
docker build -t llamafactory-fp8:latest .
# Run
docker run --gpus all --ipc=host -v $(pwd)/checkpoints:/workspace/checkpoints -v $(pwd)/configs:/workspace/configs -v /tmp:/tmp -it llamafactory-fp8:latest bash
# Inside container - Train with FP8
bash /workspace/scripts/train_fp8_llamafactory.sh /workspace/configs/qwen_7b_fp8_b200.yaml# One-command install
wget -O ~/install_fp8.sh https://raw.githubusercontent.com/sbhavani/llamafactory-fp8-hopper/master/install_fp8_fixed.sh
bash ~/install_fp8.sh
# Use it
source ~/llamafactory-fp8/setup.sh
cd ~/llamafactory-fp8/LLaMA-Factory
bash scripts/train_fp8.sh configs/qwen_7b_fp8_b200.yamlSee REMOTE_SETUP.md for details.
FP8 is enabled via Accelerate config files (the official HuggingFace way):
accelerate launch --config_file configs/accelerate_fp8.yaml \
llamafactory-cli train configs/my_model.yamlYour LLaMA-Factory configs just need:
fp8: true
fp8_backend: te
bf16: true # BF16 for non-FP8 opsAccelerate configs:
configs/accelerate_fp8.yaml- FP8 configurationconfigs/accelerate_bf16.yaml- BF16 baseline
Training configs:
configs/qwen_7b_fp8_b200.yaml- Optimized for B200 (192GB)configs/qwen_7b_bf16_b200.yaml- BF16 baseline- See
configs/for more examples
Expected on H100/B200 for 7B+ models:
| Metric | BF16 | FP8 | Improvement |
|---|---|---|---|
| Speed | 1.0x | 1.3-1.5x | 30-50% faster |
| Memory | 100% | 70-80% | 20-30% saved |
Uses fork with Accelerate config support: sbhavani/LLaMA-Factory:fix/accelerate-config-support
Key change: Detects when Accelerate is already configured and skips conflicting setup.
REMOTE_SETUP.md- Complete remote server setup guideLLAMAFACTORY_USER_GUIDE.md- Guide for existing LLaMA-Factory usersAUTHENTICATION.md- HuggingFace authentication
See .llm/ directory for:
- Bug reports and feature requests
- Implementation details
- Testing procedures
- NVIDIA H100, H200, B200, or RTX 4090
- CUDA 12.1+
- Python 3.10+
- 80GB+ VRAM (for 7B models)
Same as LLaMA-Factory (Apache 2.0)