Fast-LLM is a new open-source library for training large language models, built on PyTorch and Triton. It is extremely fast, scales to large clusters, supports a wide range of model architectures, and is easy to use. Unlike commercial frameworks like Megatron-LM, which are largely closed off and fragmented across forks, Fast-LLM is fully open-source and encourages community-driven development. Researchers can freely customize and optimize as needed, making it a flexible and hackable alternative that combines the speed of specialized tools with the openness of libraries like Hugging Face Transformers.
Note
Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.
-
🚀 Fast-LLM is Blazingly Fast:
- ⚡️ Optimized kernel efficiency and reduced overheads.
- 🔋 Optimized memory usage for best performance.
- ⏳ Minimizes training time and cost.
-
📈 Fast-LLM is Highly Scalable:
- 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
- 🔗 Supports sequence length parallelism to handle longer sequences effectively.
- 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
- 🎛️ Mixed precision training support for better performance.
- 🏋️♂️ Large batch training and gradient accumulation support.
- 🔄 Reproducible training with deterministic behavior.
-
🎨 Fast-LLM is Incredibly Flexible:
- 🤖 Compatible with all common language model architectures in a unified class.
- ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
- 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
- 🤗 Seamless integration with Hugging Face Transformers.
-
🎯 Fast-LLM is Super Easy to Use:
- 📦 Pre-built Docker images for quick deployment.
- 📝 Simple YAML configuration for hassle-free setup.
- 💻 Command-line interface for easy launches.
- 📊 Detailed logging and real-time monitoring features.
- 📚 Extensive documentation and practical tutorials (in progress).
-
🌐 Fast-LLM is Truly Open Source:
- ⚖️ Licensed under Apache 2.0 for maximum freedom to use Fast-LLM at work, in your projects, or for research.
- 💻 Fully developed on GitHub with a public roadmap and transparent issue tracking.
- 🤝 Contributions and collaboration are always welcome!
We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file examples/mistral-4-node-benchmark.yaml
is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
Note
Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of 9,800 tokens/s/H100 (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
- A Slurm cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
- CUDA 12.1 or higher.
- Dependencies: PyTorch, Triton, and Apex installed on all nodes.
-
Deploy the nvcr.io/nvidia/pytorch:24.07-py3 Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-
Install Fast-LLM on all nodes:
sbatch <<EOF #!/bin/bash #SBATCH --nodes=$(scontrol show node | grep -c NodeName) #SBATCH --ntasks-per-node=1 #SBATCH --ntasks=$(scontrol show node | grep -c NodeName) #SBATCH --exclusive srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"' EOF
-
Use the example Slurm job script examples/fast-llm.sbat to submit the job to the cluster:
sbatch examples/fast-llm.sbat
-
Monitor the job's progress:
- Logs: Follow
job_output.log
andjob_error.log
in your working directory for logs. - Status: Use
squeue -u $USER
to see the job status.
- Logs: Follow
Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
- A Kubernetes cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
- KubeFlow installed.
- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
-
Create a Kubernetes PersistentVolumeClaim (PVC) named
fast-llm-home
that will be mounted to/home/fast-llm
in the container using examples/fast-llm-pvc.yaml:kubectl apply -f examples/fast-llm-pvc.yaml
-
Create a PyTorchJob resource using the example configuration file examples/fast-llm.pytorchjob.yaml:
kubectl apply -f examples/fast-llm.pytorchjob.yaml
-
Monitor the job status:
- Use
kubectl get pytorchjobs
to see the job status. - Use
kubectl logs -f fast-llm-master-0 -c pytorch
to follow the logs.
- Use
That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
📖 Want to learn more? Check out our documentation for more information on how to use Fast-LLM.
🔨 We welcome contributions to Fast-LLM! Have a look at our contribution guidelines.
🐞 Something doesn't work? Open an issue!
Fast-LLM is licensed by ServiceNow, Inc. under the Apache 2.0 License. See LICENSE for more information.
For security issues, email disclosure@servicenow.com. See our security policy.