add starcoder2 success story

ServiceNow · tscholak · Dec 2, 2024 · Oct 22, 2024 · Oct 22, 2024 · Oct 22, 2024
commit 4369066de0a84c9a0390c0b42d9d956719308713
diff --git a/docs/about-us.md b/docs/about-us.md
@@ -1,5 +1,7 @@
 ---
 title: About Us
+hide:
+  - navigation
 ---
 
 Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
@@ -30,13 +32,3 @@ Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://ww
 - [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 
 Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
-
-## Get Involved
-
-Fast-LLM is an open-source project that thrives on collaboration. If you're a professional or researcher looking to contribute, there are many ways to get involved:
-
-- **Code Contributions:** Dive into our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
-- **Discussion and Ideas:** Join us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions) to share your insights, ask questions, or discuss new features.
-- **Documentation and Tutorials:** Help us expand our [documentation](https://servicenow.github.io/Fast-LLM/), making it even more valuable for other professionals.
-
-If you're serious about training large language models, Fast-LLM is here to help you push the limits. We look forward to your contributions and feedback as we continue to make LLM training faster and better.
diff --git a/docs/help.md b/docs/help.md
@@ -44,9 +44,9 @@ If you're the type who loves configurations and tweaking every detail, the [**Co
 
 We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
-- [**Quick-Start Guide**](quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
+- [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
 
-- [**In-Action Guides**](in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
+- [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
 
 ---
 

diff --git a/docs/index.md b/docs/index.md
@@ -1,5 +1,7 @@
 ---
 title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
+hide:
+  - navigation
 ---
 
 Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
@@ -77,6 +79,6 @@ Fast-LLM is more than just software, it's a community. Get involved by exploring
 
 ## Getting Started
 
-Ready to dive in? Check out our [quickstart guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
+Ready to dive in? Check out our [quick-start guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
 For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/join-us.md b/docs/join-us.md
@@ -0,0 +1,65 @@
+---
+title: Join Us
+hide:
+  - navigation
+---
+
+Fast-LLM is an open-source project driven by a community of passionate contributors. Whether you're a researcher, developer, or AI enthusiast, there's a place for you to make a real impact on the future of large-scale AI training. Join us, dive in, and help shape the tools that push the boundaries of language model training. Here's how you can get involved:
+
+---
+
+## Stay in the Loop 📬
+
+Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.
+
+[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) ⭐ and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) 👀 the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
+
+---
+
+## Code Contributions 🛠
+
+Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [**Good First Issues**](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
+
+- [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
+- [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code
+
+Here's a quick overview of the process:
+
+1. **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2. **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3. **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4. **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+
+[Explore the Developer's Corner for everything you need to get started!](developers)
+
+---
+
+## Feature Requests & Ideas 💡
+
+Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.
+
+[Share your thoughts on GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)
+
+---
+
+## Testing & Feedback 🔍
+
+Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.
+
+[Report issues and share feedback on GitHub](https://github.com/ServiceNow/Fast-LLM/issues)
+
+---
+
+## Help & Support 🤝
+
+Love helping others? Join our **GitHub Discussions** to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
+
+---
+
+## Spread the Word 📣
+
+If you're excited about Fast-LLM, let the world know! Share on social media, write a blog post, or give a talk at your next tech meetup. Spreading the word helps grow our community and brings new talent into the project.
+
+---
+
+Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -26,30 +26,7 @@ Let's create folders to store our input data and output results:
 mkdir ~/inputs ~/results
 ```
 
-## Step 3: Preparing the Training Data
-
-For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
-
-We've got a script that'll download and preprocess the dataset for you. Run it like this:
-
-!!! info inline end "What's Happening Here?"
-
-    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
-
-```bash
-python tools/prepare_dataset.py \                          
-    tokenizer_path_or_name="gpt2" \             
-    dataset_name_or_path="openwebtext" \                                       
-    dataset_split="train" \
-    dataset_field="text" \
-    output_dir="inputs" \ 
-    num_processes_load=4 \
-    num_processes_map=4 \
-    num_processes_save=4 \
-    num_tokens_per_shard=100000000
-```
-
-## Step 4: Choose Your Model
+## Step 3: Choose Your Model
 
 Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama, Mistral, and Qwen. For this tutorial, let's train the GPT-2 model from scratch with Fully Sharded Data Parallelism (FSDP). We'll grab a configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
 
@@ -81,6 +58,34 @@ Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama
 
     Smaller models like GPT-2 (124M) will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger models a shot!
 
+## Step 4: Preparing the Training Data
+
+For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
+
+We've got a script that'll download and preprocess the dataset for you. Run it like this:
+
+```bash
+docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+    -v ~/inputs:/app/inputs \
+    python tools/prepare_dataset.py \
+    tokenizer_path_or_name="gpt2" \
+    dataset_name_or_path="openwebtext" \
+    dataset_split="train" \
+    output_dir="inputs" \
+    num_processes_load=4 \
+    num_processes_map=4 \
+    num_processes_save=4 \
+    num_tokens_per_shard=100000000
+```
+
+!!! info "What's Happening Here?"
+
+    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+
+!!! warning "Tokenizer Mismatch"
+
+    If you chose a different model in Step 3, make sure to adjust the `tokenizer_path_or_name` parameter to match the model's tokenizer.
+
 ## Step 5: Set Up Your Training Configuration
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
@@ -167,10 +172,11 @@ docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
     -v ~/results:/app/results \
     -e PYTHONHASHSEED=0 \
     -e WANDB_API_KEY_PATH=/app/inputs/.wandb_api_key \
-    torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
+    torchrun --nproc_per_node=8 --no_python \
+    fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
 ```
 
-!!! note
+!!! note "Python Hash Seed"
 
     Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes, which is crucial for parallel computations.
 

diff --git a/docs/refs.bib b/docs/refs.bib
@@ -0,0 +1,13 @@
+@article{li2023starcoder,
+  title={Starcoder: may the source be with you!},
+  author={Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others},
+  journal={arXiv preprint arXiv:2305.06161},
+  year={2023}
+}
+
+@article{lozhkov2024starcoder,
+  title={Starcoder 2 and the stack v2: The next generation},
+  author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
+  journal={arXiv preprint arXiv:2402.19173},
+  year={2024}
+}