diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index 6f40cfda..72e68d69 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -108,6 +108,7 @@ echo "=== END OF ENVIRONMENT INFORMATION ==="
 # 📝 Additional Context
 
 Include any other information that may help us understand the issue, such as:
+
 - Recent changes to the configuration or code.
 - Whether the issue occurs consistently or intermittently.
 - Any troubleshooting steps you have already tried.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 82f811b4..7eb522e1 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -25,40 +25,45 @@ List the key changes introduced in this PR:
 1. Change A
 2. Change B
 
-# ✅ Checklist
+## ✅ Checklist
 
 Make sure the following tasks are completed before submitting the PR:
 
-### General:
-- [ ] 📜 I have read and followed the [contributing guidelines](CONTRIBUTING.md).
+### General
+
+- [ ] 📜 I have read and followed the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing).
+- [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
 - [ ] 🎉 The functionality is complete, and I have tested the changes.
 - [ ] 📝 I have updated the documentation if needed.
 - [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
 - [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
 
-### Dependencies and Configuration:
+### Dependencies and Configuration
+
 - [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
 - [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
 
-### Testing:
+### Testing
+
 - [ ] 🧪 I have added or updated tests to cover my changes.
 - [ ] ✔️ New and existing tests pass locally with my changes.
 - [ ] 🚦 I have tested these changes on GPUs and verified training stability.
 - [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
 
-### Performance Impact:
+### Performance Impact
+
 - [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
 - [ ] ✅ The benchmarks show no performance regression.
 - [ ] 🚀 The benchmarks indicate a potential performance improvement.
 - [ ] ⚠️ The benchmarks indicate a potential performance degradation.
 - [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
 
-# 📊 Performance Impact Details
+## 📊 Performance Impact Details
 
 If there is any impact on performance, describe it and provide benchmark results, if applicable:
 
 ---
 
-# 📝 Additional Notes
+## 🗒️ Additional Notes
 
 Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.
diff --git a/.gitignore b/.gitignore
index 41502c68..4f834433 100644
--- a/.gitignore
+++ b/.gitignore
@@ -8,6 +8,7 @@ __pycache__/
 
 # Doc build
 .cache
+site
 
 # Distribution / packaging
 *.egg-info/
@@ -27,3 +28,11 @@ venv.bak/
 # Project specifics
 /.idea/
 /.vscode/
+
+# Devenv
+.devenv*
+devenv.local.nix
+devenv.*
+
+# direnv
+.direnv
diff --git a/.markdownlint.yaml b/.markdownlint.yaml
new file mode 100644
index 00000000..3b8bac64
--- /dev/null
+++ b/.markdownlint.yaml
@@ -0,0 +1,35 @@
+# See https://github.com/DavidAnson/markdownlint/blob/v0.32.1/schema/.markdownlint.yaml for schema documentation
+
+# Default state for all rules
+default: true
+
+# MD007/ul-indent : Unordered list indentation : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md007.md
+MD007:
+  # Spaces for indent
+  indent: 2
+
+# MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
+MD010:
+  # Include code blocks
+  code_blocks: false
+  # Fenced code languages to ignore
+  ignore_code_languages: []
+  # Number of spaces for each hard tab
+  spaces_per_tab: 2
+
+# MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
+MD013: false
+
+# MD024/no-duplicate-heading : Multiple headings with the same content : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md024.md
+MD024: false
+
+# MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
+MD030:
+  # Spaces for single-line unordered list items
+  ul_single: 1
+  # Spaces for single-line ordered list items
+  ol_single: 1
+  # Spaces for multi-line unordered list items
+  ul_multi: 1
+  # Spaces for multi-line ordered list items
+  ol_multi: 1
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index f8465c52..480b669b 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -48,3 +48,7 @@ repos:
         args:
             - "--config"
             - "./pyproject.toml"
+-   repo: https://github.com/markdownlint/markdownlint
+    rev: v0.11.0
+    hooks:
+    -   id: markdownlint
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index b3b61bc8..4e623f9f 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,8 +1,8 @@
-### ServiceNow Open Source Code-of-Conduct
+# ServiceNow Open Source Code-of-Conduct
 
 This code of conduct provides guidelines for participation in ServiceNow-managed open-source communities and projects.
 
-**Discussion forum guidelines**
+## Discussion forum guidelines
 
 Communities thrive when members support each other and provide useful feedback.
 
@@ -11,12 +11,12 @@ Communities thrive when members support each other and provide useful feedback.
 - User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
 - Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
 - Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
-- Do not publish anyone’s private personal information without their explicit consent.
+- Do not publish anyone's private personal information without their explicit consent.
 - Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
 - Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
 - Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
 
-**Issue board guidelines**
+## Issue board guidelines
 
 Many open-source projects provide an Issues board, with similar functionality to a Discussions forum. The same rules from the discussion forum guidelines apply to the Issues board.
 
@@ -25,22 +25,22 @@ ServiceNow suggests the following technical support pathways for open-source pro
 1. Clearly identify and document the issue or question you have.
 2. View the Documentation.
 3. Search the Discussions.
-4. Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
-5. Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
-6. Log an Issue if it hasn’t already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
+4. Search the project documentation for known errors, useful solutions, and troubleshooting tips.
+5. Check the project contribution guidelines if you would like details on how you can submit a change. Community contributions are valued and appreciated!
+6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
 7. Contact the project team contributors of the project to see if they can help as a last resort only.
 
-**Repositories**
+## Repositories
 
 - Read and follow the license instructions
-- Remember to include citations if you use someone else’s work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
-- ‘Star’ project repos to save for future reference.
-- ‘Watch’ project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
+- Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
+- ‘Star' project repos to save for future reference.
+- ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
-**Enforcement and reporting**
+## Enforcement and reporting
 
-We encourage community members and users to help each other and to resolve issues amongst themselves as much as possible. If a matter cannot be resolved in good faith within the means available, please reach out to a team member or email fast-llm-team@servicenow.com.
+We encourage community members and users to help each other and to resolve issues amongst themselves as much as possible. If a matter cannot be resolved in good faith within the means available, please reach out to a team member or email [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com).
 
-**ServiceNow Disclaimer.**
+## ServiceNow Disclaimer
 
 We may, but are under no obligation to, monitor or censor comments made by users or content provided by contributors and we are not responsible for the accuracy, completeness, appropriateness or legality of anything posted, depicted or otherwise provided by third‑party users and we disclaim any and all liability relating thereto.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 6c6aece2..16580f7d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,62 +1,3 @@
-# Contributing to Fast-LLM 🚀
+# Contributing to Fast-LLM
 
-Thank you for your interest in contributing to Fast-LLM! We're thrilled to have you here, and your support is invaluable in helping us accelerate LLM training to full speed. This guide will walk you through the steps to contribute, from reporting issues to submitting changes and setting up your development environment.
-
-If you have questions or want to start a discussion, feel free to [open a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) on our GitHub page.
-
-## Getting Started
-
-To get started with contributing to Fast-LLM, follow these steps to set up your environment:
-
-1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/development/setup) for instructions on getting everything ready, including the development environment and dependencies.
-2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/development/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
-3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/development/launching) to get Fast-LLM up and running.
-
-## How to Report a Bug 🐞
-
-Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
-
-- Steps to reproduce the issue.
-- What you expected to happen versus what actually happened.
-- Logs, Fast-LLM configuration, and error messages.
-- Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
-
-If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
-
-## Proposing Changes
-
-Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
-
-1. **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
-2. **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
-3. **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
-4. **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/development/pr-title-guidelines). This title will become the commit message for the squashed merge.
-6. **Push to Your Fork**: Push the branch to your GitHub fork.
-7. **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
-
-### Guidelines for a Successful Pull Request
-
-Here are some tips to ensure your pull request gets reviewed and merged promptly:
-
-- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/development/dev-practices/) to keep the code clean and consistent.
-- **Write tests**: Verify your changes with unit tests for new features or bug fixes.
-- **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
-- **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
-- **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
-- **Comment non-trivial code**: Make your code easy to understand for others.
-- **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
-- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
-
-## Seeking Help or Clarification
-
-If you're unsure about something or need help, you've got options:
-
-- **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
-- **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
-
-## Contributors
-
-We're grateful for all the awesome contributors who help make Fast-LLM better. Join our contributors' list and make your first contribution!
-
-To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us/).
+Please refer to the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing) for more information on how to contribute to Fast-LLM.
diff --git a/README.md b/README.md
index c2324ad8..9da114bb 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,11 @@ Made with ❤️ by [ServiceNow Research][servicenow-research]
 
 ## Overview
 
-Fast-LLM is a new open-source library for training large language models, built on [PyTorch][pytorch] and [Triton][triton]. It is extremely fast, scales to large clusters, supports a wide range of model architectures, and is easy to use. Unlike commercial frameworks like Megatron-LM, which are largely closed off and fragmented across forks, Fast-LLM is fully open-source and encourages community-driven development. Researchers can freely customize and optimize as needed, making it a flexible and hackable alternative that combines the speed of specialized tools with the openness of libraries like [Hugging Face Transformers][transformers].
+Fast-LLM is a cutting-edge open-source library for training large language models with exceptional speed, scalability, and flexibility. Built on [PyTorch][pytorch] and [Triton][triton], Fast-LLM empowers AI teams to push the limits of generative AI, from research to production.
+
+Optimized for training models of all sizes—from small 1B-parameter models to massive clusters with 70B+ parameters—Fast-LLM delivers faster training, lower costs, and seamless scalability. Its fine-tuned kernels, advanced parallelism techniques, and efficient memory management make it the go-to choice for diverse training needs.
+
+As a truly open-source project, Fast-LLM allows full customization and extension without proprietary restrictions. Developed transparently by a community of professionals on GitHub, the library benefits from collaborative innovation, with every change discussed and reviewed in the open to ensure trust and quality. Fast-LLM combines professional-grade tools with unified support for GPT-like architectures, offering the cost efficiency and flexibility that serious AI practitioners demand.
 
 > [!NOTE]
 > Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.
@@ -25,7 +29,7 @@ Fast-LLM is a new open-source library for training large language models, built
     - ⚡️ Optimized kernel efficiency and reduced overheads.
     - 🔋 Optimized memory usage for best performance.
     - ⏳ Minimizes training time and cost.
-  
+
 2. 📈 **Fast-LLM is Highly Scalable**:
     - 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
     - 🔗 Supports sequence length parallelism to handle longer sequences effectively.
@@ -49,7 +53,7 @@ Fast-LLM is a new open-source library for training large language models, built
 
 5. 🌐 **Fast-LLM is Truly Open Source**:
     - ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
-    - 💻 Fully developed on GitHub with a public [roadmap][roadmap] and transparent [issue tracking][issues].
+    - 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
     - 🤝 Contributions and collaboration are always welcome!
 
 ## Usage
diff --git a/SECURITY.md b/SECURITY.md
index 643b23f7..e3a80c5b 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -16,7 +16,7 @@ If you find a vulnerability in ServiceNow systems, products, or network infrastr
 If you find a vulnerability in this open-source project published by the ServiceNow Research team, please email [servicenow-research@servicenow.com](mailto:servicenow-research@servicenow.com) to report your findings.
 
 We will process your report as soon as possible, depending on the severity of your report. We appreciate everyone's help in disclosing vulnerabilities in a responsible manner.
- 
+
 ## Guidelines
 
 Please follow the guidelines below when [disclosing vulnerabilities](https://www.servicenow.com/company/trust/privacy/responsible-disclosure.html):
diff --git a/docs/.markdownlint.yaml b/docs/.markdownlint.yaml
new file mode 100644
index 00000000..44d5cf91
--- /dev/null
+++ b/docs/.markdownlint.yaml
@@ -0,0 +1,32 @@
+# See https://github.com/DavidAnson/markdownlint/blob/v0.32.1/schema/.markdownlint.yaml for schema documentation
+
+# Default state for all rules
+default: true
+
+# MD007/ul-indent : Unordered list indentation : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md007.md
+MD007:
+  # Spaces for indent
+  indent: 4
+
+# MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
+MD010:
+  # Include code blocks
+  code_blocks: false
+  # Fenced code languages to ignore
+  ignore_code_languages: []
+  # Number of spaces for each hard tab
+  spaces_per_tab: 4
+
+# MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
+MD013: false
+
+# MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
+MD030:
+  # Spaces for single-line unordered list items
+  ul_single: 3
+  # Spaces for single-line ordered list items
+  ol_single: 2
+  # Spaces for multi-line unordered list items
+  ul_multi: 3
+  # Spaces for multi-line ordered list items
+  ol_multi: 2
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 00000000..1db83b67
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,41 @@
+# Fast-LLM Documentation Sources
+
+This folder contains the source files for the Fast-LLM documentation. The contents here are used to generate the rendered documentation, which is automatically updated and published whenever changes are pushed to the `main` branch.
+
+## 📚 Access the Rendered Documentation
+
+To view the complete, rendered documentation, please visit the [Fast-LLM Documentation Site](https://servicenow.github.io/Fast-LLM).
+
+## Building and Serving the Documentation
+
+To build and preview the documentation locally, follow these simple steps:
+
+1.  **Install the necessary dependencies:**
+
+    ```bash
+    pip install --no-build-isolation -e ".[DOCS]"
+    ```
+
+    You also need to install `libcairo` for image processing on your system. Follow <https://squidfunk.github.io/mkdocs-material/plugins/requirements/image-processing/> for more details.
+
+2.  **Build the documentation:**
+
+    ```bash
+    mkdocs build
+    ```
+
+    This will generate the static documentation files in a `site/` folder.
+
+3.  **Serve the documentation locally (with auto-reload):**
+
+    ```bash
+    mkdocs serve
+    ```
+
+    The documentation site will be served locally at [http://127.0.0.1:8000](http://127.0.0.1:8000), and any changes made to the source files will automatically trigger a rebuild.
+
+## Contributing to the Documentation
+
+If you'd like to contribute to the Fast-LLM documentation, feel free to edit these source files and submit a pull request. The changes will be reflected on the rendered documentation site after they are merged into the `main` branch.
+
+Your contributions could be as simple as helping to correct typos and spelling errors, improving existing content to provide more details on how to approach a tricky step for novice users, or even to add new content that describes functionality with limited or no detailed coverage anywhere else. No matter how small, we value all contributions from the Fast-LLM community.
diff --git a/docs/about-us.md b/docs/about-us.md
new file mode 100644
index 00000000..eedd852a
--- /dev/null
+++ b/docs/about-us.md
@@ -0,0 +1,34 @@
+---
+title: About Us
+hide:
+  - navigation
+---
+
+Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
+
+## 🚀 Our Mission
+
+Our mission is to deliver a best-in-class library for training large-scale language models, combining cutting-edge performance with robust, customizable features. Fast-LLM is built to meet the needs of researchers and organizations who push the boundaries of generative AI, enabling them to train state-of-the-art models more efficiently. By optimizing training workflows and scaling to massive compute clusters, we help professionals unlock the full potential of LLMs, reducing costs and time-to-deployment for ambitious AI projects.
+
+## 🌍 Our Vision
+
+We envision Fast-LLM as the go-to solution for serious AI practitioners who require more than what typical frameworks can offer. Our goal is to empower research institutions, corporate AI teams, and universities to train sophisticated models that exceed the capabilities of standard tools. By creating a highly performant and customizable library, we aim to be the backbone of cutting-edge AI research and development, equipping experts with the tools they need to tackle the toughest training challenges.
+
+## 🎯 Our Values
+
+At Fast-LLM, we adhere to a set of guiding principles that define our approach:
+
+-   **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
+-   **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
+-   **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
+-   **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
+
+## 👥 Meet the Team
+
+Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
+
+-   [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+-   [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
+-   [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
+
+Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
diff --git a/docs/community/feedback.md b/docs/community/feedback.md
deleted file mode 100644
index dcd70162..00000000
--- a/docs/community/feedback.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Feedback
-
-Coming soon...
diff --git a/docs/community/index.md b/docs/community/index.md
deleted file mode 100644
index 684e27f7..00000000
--- a/docs/community/index.md
+++ /dev/null
@@ -1 +0,0 @@
-Coming soon...
diff --git a/docs/developers/contribute.md b/docs/developers/contribute.md
deleted file mode 100644
index 2adb786c..00000000
--- a/docs/developers/contribute.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Contributing to Fast-LLM
-
-Coming soon...
diff --git a/docs/developers/contributing.md b/docs/developers/contributing.md
new file mode 100644
index 00000000..38b01868
--- /dev/null
+++ b/docs/developers/contributing.md
@@ -0,0 +1,63 @@
+---
+title: Contributing
+---
+
+Thank you for your interest in contributing to Fast-LLM! We're thrilled to have you here, and your support is invaluable in helping us accelerate LLM training to full speed. This guide will walk you through the steps to contribute, from reporting issues to submitting changes and setting up your development environment.
+
+If you have questions or want to start a discussion, feel free to [open a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) on our GitHub page.
+
+## 🚀 Getting Started
+
+To get started with contributing to Fast-LLM, follow these steps to set up your environment:
+
+1.  **Learn Our Development Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices), which cover development setup, testing, and benchmarking.
+2.  **Read the Style Guide**: Follow our [style guide](https://servicenow.github.io/Fast-LLM/developers/style-guide) to maintain consistency in code style, documentation, and commit messages.
+
+## 🐞 How to Report a Bug
+
+Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
+
+-   Steps to reproduce the issue.
+-   What you expected to happen versus what actually happened.
+-   Logs, Fast-LLM configuration, and error messages.
+-   Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
+
+If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
+
+## 🛠️ Proposing Changes
+
+Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
+
+1.  **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
+2.  **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
+3.  **Create a New Branch**: Name your branch descriptively, such as `fix/training-memory-leak` or `feature/rope-scaling`.
+4.  **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
+5.  **Push to Your Fork**: Push the branch to your GitHub fork.
+6.  **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
+
+## 🏆 Guidelines for a Successful Pull Request
+
+Here are some tips to ensure your pull request gets reviewed and merged promptly:
+
+-   **Follow our coding standards**: Stick to our [style guide and conventions](https://servicenow.github.io/Fast-LLM/developers/style-guide) to keep the code clean and consistent.
+-   **Write tests**: Verify your changes with unit tests for new features or bug fixes.
+-   **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
+-   **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
+-   **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
+-   **Comment non-trivial code**: Make your code easy to understand for others.
+-   **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
+-   **Use a clear and descriptive title**: The PR title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code." Start with a keyword like `[feat]`, `[fix]`, `[docs]`, etc. to categorize the change. Reference the issue number if applicable (e.g., `[fix] resolve #123 memory leak in training loop`). This title will become the commit message for the squashed merge.
+-   **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+
+## 🆘 Seeking Help or Clarification
+
+If you're unsure about something or need help, you've got options:
+
+-   **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
+-   **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
+
+## 🌟 Contributors
+
+We're grateful for all the awesome contributors who help make Fast-LLM better. Join our contributors' list and make your first contribution!
+
+To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us).
diff --git a/docs/developers/dev-practices.md b/docs/developers/dev-practices.md
new file mode 100644
index 00000000..da71eca6
--- /dev/null
+++ b/docs/developers/dev-practices.md
@@ -0,0 +1,15 @@
+---
+title: Development Practices
+---
+
+!!! warning
+
+    Work in progress! Check back soon for the updated content.
+
+## Recommended Development Setup
+
+Stay tuned...
+
+## Testing and Benchmarking
+
+Stay tuned...
diff --git a/docs/developers/index.md b/docs/developers/index.md
deleted file mode 100644
index 74081485..00000000
--- a/docs/developers/index.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Developer Guides
-
-* [Contributing](contribute.md): How to contribute to Fast-LLM.
diff --git a/docs/developers/style-guide.md b/docs/developers/style-guide.md
new file mode 100644
index 00000000..1d0a45f2
--- /dev/null
+++ b/docs/developers/style-guide.md
@@ -0,0 +1,7 @@
+---
+title: Style Guide
+---
+
+!!! warning
+
+    This section is work in progress. Please check back soon for the updated content.
diff --git a/docs/help.md b/docs/help.md
new file mode 100644
index 00000000..6eed9b04
--- /dev/null
+++ b/docs/help.md
@@ -0,0 +1,65 @@
+---
+title: "Help"
+---
+
+Welcome to the Fast-LLM Help Center! Here, you'll find fixes for common hiccups, links to dig deeper, tutorials, and pointers for when you need some extra support. Remember, everyone hits a snag now and then. Let's sort them out together and get you back to training.
+
+---
+
+## Common Issues & Gotchas 🚧
+
+Let's stay one step ahead of those pesky gotchas. Here's a list of common issues and quick fixes:
+
+-   **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` or `full` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend.
+
+-   **Python Hash Seed Sync Error**: Encountering an error like
+
+    ```bash
+    RuntimeError: Desync detected for barrier train begin (66830148464 != 133042721120)
+    ```
+  
+    points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures that Python's hash seed is consistent across all processes. If these processes have different hash seeds, they'll generate different hash values, leading to desynchronization, as seen in the error message.
+
+-   **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
+
+-   **NCCL Errors with Timeout Messages**: Oh, the joys of NCCL errors! If you see something like
+
+    ```bash
+    Watchdog caught collective operation timeout: WorkNCCL(SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout(ms)=600000) ran for 600351 milliseconds before timing out
+    ```
+  
+    appearing across all GPU workers, it usually means one or more hosts failed to complete a NCCL operation, causing others to block. NCCL errors can be frustrating to diagnose since they rarely specify which node or GPU caused the issue. It is difficult to surface which messages and operations are in progress during these crashes. In most cases, the best we can do is to restart the training job and hope it doesn't happen again. If the issue persists, it might be because of network congestion or a problematic GPU. If the worker that crashed is consistent across multiple runs, it's likely a hardware issue. If you can't resolve it, open an issue on GitHub, and we'll help you troubleshoot.
+
+For more detailed solutions, check out our GitHub Issues page. Odds are someone's already tackled a similar problem, and you might find the exact fix you need.
+
+---
+
+## Reference 📚
+
+If you're the type who loves configurations and tweaking every detail, the [**Configuration Reference**](reference/configuration.md) is for you. It covers every config option you could imagine. From optimizer settings to batch sizes to distributed training parameters. It's all in there.
+
+---
+
+## Tutorials 👨‍🏫
+
+We've got some excellent tutorials to help you get the most out of Fast-LLM:
+
+-   [**Quick-Start Guide**](quick-start.md): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
+
+-   [**Cookbook**](recipes/train-llama-8b.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
+
+---
+
+## Still Stuck? Where to Find Help 🙋
+
+If Fast-LLM still isn't cooperating, here's where to look next:
+
+1.  **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
+
+2.  **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
+
+Fast-LLM is a growing community, and your questions and contributions help make it better for everyone. Who knows, you might just solve the next person's roadblock!
+
+---
+
+That's it! We're excited to see what you build with Fast-LLM. Happy training!
diff --git a/docs/index.md b/docs/index.md
index aedfb6ee..0171e79f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,27 +1,84 @@
 ---
-title: Fast-LLM
+title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
 hide:
   - navigation
-  - toc
-  - feedback
 ---
 
-# Fast-LLM
+Introducing **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI researchers, AI/ML engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
-Welcome to Fast-LLM, an innovative library designed for training large language models with an emphasis on speed, flexibility, and convenience. Developed by ServiceNow Research's Foundation Models Lab, Fast-LLM is tailored to meet the rigorous demands of enterprise AI solutions, providing a foundation for our bespoke generative AI applications.
+[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train-llama-8b.md) to see how Fast-LLM can elevate your training workflows.
 
-## Key Features
+## Why Fast-LLM?
 
-- **Speed**: Fast-LLM delivers unparalleled training throughput, achieving speeds up to 4,000 tokens/s/GPU for Mixtral-8x7B and nearly 9,000 tokens/s/GPU for Mistral-7B, facilitating rapid model development and iteration.
-- **Flexibility**: The library supports a diverse array of model architectures including, but not limited to, GPT, StarCoder, Llama, Mistral, and Mixtral. It is designed to be adaptable, allowing for easy expansion and customization to a broad range of models and training scenarios.
-- **Convenience**: Designed with the user in mind, Fast-LLM aims to be straightforward and intuitive, enabling researchers and developers to focus more on innovation and less on the complexities of the tooling.
+Fast-LLM is designed for professionals who demand exceptional performance for efficient large-scale (FLOPS) language model training on GPUs. Fast-LLM integrates effortlessly into existing ML pipelines and goes beyond off-the-shelf commercial frameworks, like NVIDIA NeMo Megatron, to deliver a **robust, flexible, and high-performance open-source alternative**. Whether you're optimizing for speed, cost, or scalability, Fast-LLM helps you get the most out of your training infrastructure.
+
+### The Fast-LLM Advantage
+
+Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here's what sets it apart:
+
+-   **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
+
+-   **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single Python file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+
+-   **💰 Cost Efficiency That Sets Fast-LLM Apart:**
+
+    -   **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training models can cheaper compared to other frameworks due to faster processing and better memory efficiency.
+
+    -   **More Tokens for Your Budget:** Train on more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
+
+    <!-- [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md). -->
+
+-   **🔓 Openness Without Compromise:** Fast-LLM's open-source approach ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of experts on GitHub, every change is **publicly discussed and vetted**, fostering **trust and collaboration** so you can innovate with confidence, knowing the entire development process and decision making is out in the open.
+
+-   **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us.md) to help shape the future of large-scale AI training.
+
+### Key Features
+
+Fast-LLM offers all the capabilities you need to accelerate your LLM training and **push the boundaries of what's possible**:
+
+-   **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs (batch size 32, sequence length 8k). Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+
+-   **📡 Unmatched Scalability:** Seamlessly scale from a single GPU to large compute clusters. Fast-LLM supports 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1,2,3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+
+-   **🎛️ Total Flexibility:** Compatible with all major language model architectures, including but not limited to Llama, Mistral, StarCoder, and Mixtral. Fast-LLM's modular design gives you full control over your training workflows.
+
+-   **📦 Seamless Integration:** Integrate smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers). Benefit from Fast-LLM's optimizations without disrupting your existing pipelines.
+
+-   **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
+
+[Get Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models in record time. [Join the Fast-LLM community](join-us.md) and collaborate with like-minded professionals to advance the state-of-the-art in AI research and development.
+
+## Use Cases and Success Stories
+
+Fast-LLM powers the world's most advanced AI projects:
+
+-   **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
+-   **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
+-   **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
+
+See how Fast-LLM has helped early adopters achieve faster results. [Explore use cases and success stories](success-stories/starcoder-2.md).
 
 ## Project Scope and Objectives
 
-Fast-LLM seeks to provide a high-quality alternative to existing frameworks such as Megatron-LM and NeMo. It is compatible with 3D parallelism and is designed to integrate seamlessly with Huggingface Transformers, promoting not only efficient model training but also straightforward model deployment and inference.
+Fast-LLM is designed to be the **go-to solution** for those training the most sophisticated language models. Our objectives include:
+
+-   **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
+-   **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
+-   **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
+-   **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
 
 ## Collaboration and Contribution
 
-The project is set for open-sourcing in Q2 2024, inviting contributions from the community in areas such as testing, bug fixes, new features, and documentation. We are especially interested in enhancements related to custom kernels using OpenAI's Triton JIT compiler and adaptations for alternative hardware platforms like AMD and Intel.
+As Fast-LLM evolves, we invite the community to contribute and help shape its future. We welcome:
+
+-   **Testing and Bug Fixes:** Help us identify issues and improve stability.
+-   **Feature Development:** Contribute new models, new training features, and new optimizations.
+-   **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our documentation and writing practical guides.
+
+Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](developers/contributing.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
+
+## Getting Started
+
+Ready to dive in? Check out our [quick-start guide](quick-start.md) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
-For more details on getting involved or using Fast-LLM, please refer to our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and the subsequent sections of this documentation.
+For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/join-us.md b/docs/join-us.md
new file mode 100644
index 00000000..31ff49ab
--- /dev/null
+++ b/docs/join-us.md
@@ -0,0 +1,57 @@
+---
+title: Join Us
+hide:
+  - navigation
+---
+
+Fast-LLM is an open-source project driven by a community of passionate contributors. Whether you're a researcher, developer, or AI enthusiast, there's a place for you to make a real impact on the future of large-scale AI training. Join us, dive in, and help shape the tools that push the boundaries of language model training. Here's how you can get involved:
+
+## 📬 Stay in the Loop
+
+Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.
+
+[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
+
+## 🛠 Code Contributions
+
+Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [Good First Issues](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
+
+-   [**Contributing**](developers/contributing.md) – for setup instructions and contributing guidelines
+-   [**Best Practices**](developers/dev-practices.md) – for tips on writing clean, maintainable code
+
+Here's a quick overview of the process:
+
+1.  **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2.  **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3.  **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4.  **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+
+Explore our [Developer's Corner](developers/contributing.md) for everything you need to get started!
+
+## 💡 Feature Requests & Ideas
+
+Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.
+
+Share your thoughts on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
+
+## 🔍 Testing & Feedback
+
+Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.
+
+Report issues and share feedback on [GitHub Issues](https://github.com/ServiceNow/Fast-LLM/issues).
+
+## 🤝 Help & Support
+
+Love helping others? Join our [**GitHub Discussions**](https://github.com/ServiceNow/Fast-LLM/discussions) to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
+
+## 📣 Spread the Word
+
+If you're excited about Fast-LLM, let the world know! Share on social media, write a blog post, or give a talk at your next tech meetup. Spreading the word helps grow our community and brings new talent into the project.
+
+## 🌟 Join Our Team
+
+Excited about contributing on a deeper level? The Foundation Models Lab at ServiceNow is at the forefront of large-scale AI training. We're looking for passionate individuals to push the boundaries of AI development with us. From research developers focusing on GPU optimization to visiting researchers refining our training frameworks, there's a role for everyone. Explore current opportunities and become a key player in shaping the future of AI at ServiceNow.
+
+Check out our [Careers page](https://www.servicenow.com/research/careers.html) for more information.
+
+Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
diff --git a/docs/license.md b/docs/license.md
index 58b5946b..eb39eeda 100644
--- a/docs/license.md
+++ b/docs/license.md
@@ -2,11 +2,9 @@
 title: License
 ---
 
-# License and citations
-
 Fast-LLM is licenced under the Apache 2.0 license:
 
-```
+```text
 Copyright 2024 ServiceNow, Inc.
 
 Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/docs/quick-start.md b/docs/quick-start.md
new file mode 100644
index 00000000..644399b9
--- /dev/null
+++ b/docs/quick-start.md
@@ -0,0 +1,842 @@
+---
+title: "Quick Start"
+---
+
+This guide will get you up and running with Fast-LLM on a single machine. Let's train a model and see some results!
+
+## Prerequisites
+
+To follow this guide, you'll need:
+
+-   **Hardware**: At least one NVIDIA GPU with Volta architecture or newer. We wrote this guide with an 8-GPU machine of Ampere or Hopper architecture in mind.
+-   **Software**:
+    -   **Docker** (if using the Docker setup), or
+    -   **Local Environment**: PyTorch 2.2 or later, CUDA 12.1 or later, and APEX AMP (if building from source), or
+    -   **Cluster Setup**: Access to a Kubernetes or Docker-enabled Slurm cluster.
+-   **Time**: The initial setup and training process requires a little patience. 😊
+
+## 🏗 Step 1: Initial Setup
+
+First, choose your environment. You can use Docker, your local environment, Slurm, or Kubernetes.
+
+=== "Docker"
+
+    You selected Docker for this tutorial. We'll use the Fast-LLM Docker image to train our model, which includes all the necessary dependencies. Grab the [pre-built Fast-LLM Docker image](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) from GitHub's container registry (GHCR).
+
+    ```bash
+    docker pull ghcr.io/servicenow/fast-llm:latest
+    ```
+
+    Let's also create folders to store our input data and output results:
+
+    ```bash
+    mkdir ~/inputs ~/results
+    ```
+
+=== "Local Environment"
+
+    You're setting up Fast-LLM in your machine's local environment. This means you'll need to install Fast-LLM and its dependencies. For simplicity and reproducibility, we recommend using the Fast-LLM Docker image instead. It's preconfigured with everything you need. But if you're set on a local installation, follow the steps below.
+
+    Fast-LLM depends on [CUDA](https://developer.nvidia.com/about-cuda) 12.1 or later, [PyTorch](https://pytorch.org) 2.2 or later, [APEX](https://github.com/NVIDIA/apex?tab=readme-ov-file#installation), and [OpenAI Triton](https://github.com/triton-lang/triton). Follow the instructions on their respective websites to install them. If you use [conda](https://docs.conda.io/projects/conda/en/latest/index.html), you can create a new environment and install these dependencies in it.
+    
+    Now, make sure PyTorch can access your GPU by running the following command:
+
+    ```bash
+    python -c "import torch; print(torch.cuda.is_available())"
+    ```
+
+    If APEX is correctly installed, the following command should run without errors:
+
+    ```bash
+    python -c "from amp_C import *"
+    ```
+
+    For Triton, you can verify the installation by running:
+
+    ```bash
+    python -c "import triton; print(triton.__version__)"
+    ```
+    
+    Fast-LLM also depends on [FlashAttention-2](https://github.com/Dao-AILab/flash-attention), which will be installed automatically when you install Fast-LLM:
+
+    ```bash
+    pip install --no-build-isolation "git+https://github.com/ServiceNow/Fast-LLM.git#egg=fast_llm[CORE,OPTIONAL,DEV]"
+    ```
+
+    You can verify the installation by running:
+
+    ```bash
+    python -C "import flash_attn; print(flash_attn.__version__)"
+    ```
+
+    and
+
+    ```bash
+    python -c "import fast_llm; print(fast_llm.__version__)"
+    ```
+
+    At this point, you should be ready to run Fast-LLM on your local environment.
+
+    Before we continue, let's create folders to store our input data and output results:
+
+    ```bash
+    mkdir /mnt/inputs /mnt/results
+    ```
+
+    If this location isn't writable, you can create the folders in your home directory:
+
+    ```bash
+    mkdir ~/inputs ~/results
+    ```
+
+    Make sure to update the paths in the following commands accordingly.
+
+=== "Slurm"
+
+    You've chosen Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. Slurm will pull the `ghcr.io/servicenow/fast-llm:latest` Docker image to train the model. Just make sure there's a shared file system for both input data and output results. We'll assume your home directory is accessible across all nodes.
+
+    Let's create a folder to store our input data and output results in the shared home directory:
+
+    ```bash
+    mkdir ~/inputs ~/results
+    ```
+
+=== "Kubernetes"
+
+    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs).
+
+    Let's now create two PVCs named `pvc-fast-llm-inputs` and `pvc-fast-llm-results` to store our input data and output results, respectively.
+
+    Create a file named `pvc-fast-llm-inputs.yaml` with the following content:
+
+    ```yaml
+    # Persistent volume claim for Fast-LLM inputs
+    apiVersion: "v1"
+    kind: "PersistentVolumeClaim"
+    metadata:
+      name: "pvc-fast-llm-inputs"
+    spec:
+      storageClassName: local-path
+      accessModes:
+        - ReadWriteMany
+      resources:
+        requests:
+          storage: 1000Gi
+    ```
+
+    Then, create a second file named `pvc-fast-llm-results.yaml` with these contents:
+
+    ```yaml
+    # Persistent volume claim for Fast-LLM results
+    apiVersion: "v1"
+    kind: "PersistentVolumeClaim"
+    metadata:
+      name: "pvc-fast-llm-results"
+    spec:
+      storageClassName: local-path
+      accessModes:
+        - ReadWriteMany
+      resources:
+        requests:
+          storage: 1000Gi
+    ```
+
+    Apply both PVCs to your Kubernetes cluster:
+
+    ```bash
+    kubectl apply -f pvc-fast-llm-inputs.yaml
+    kubectl apply -f pvc-fast-llm-results.yaml
+    ```
+
+    We also need to create a temporary pod that mounts the inputs PVC and allows us to copy files there. Here's a basic YAML configuration for such a pod:
+
+    ```yaml
+    # Temporary pod to manage input data and results
+    apiVersion: v1
+    kind: Pod
+    metadata:
+      name: fast-llm-data-management
+    spec:
+      containers:
+        - name: fast-llm-data-management-container
+          image: ubuntu
+          command: ["sleep", "infinity"]
+          volumeMounts:
+            - mountPath: /mnt/inputs
+              name: inputs
+            - mountPath: /mnt/results
+              name: results
+      volumes:
+        - name: inputs
+          persistentVolumeClaim:
+            claimName: pvc-fast-llm-inputs
+        - name: results
+          persistentVolumeClaim:
+            claimName: pvc-fast-llm-results
+    ```
+
+    Save this configuration to a file named `pod-fast-llm-data-management.yaml`. Next, apply this configuration to your Kubernetes cluster to create the pod:
+
+    ```bash
+    kubectl apply -f pod-fast-llm-data-management.yaml
+    ```
+
+    The pod will allow you to copy files to and from the inputs and results PVCs. You can access it by running:
+
+    ```bash
+    kubectl exec -it fast-llm-data-management -- /bin/bash
+    ```
+
+    !!! note "Cleaning up unused resources"
+    
+        At the very end of this guide, you should clean up the data management pod to avoid unnecessary resource consumption by running
+
+        ```bash
+        kubectl delete pod fast-llm-data-management
+        ```
+
+        Don't run this just yet, though. You'll need the pod throughout the guide.
+
+## 🤖 Step 2: Choose Your Model
+
+Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, you can choose from two models:
+
+=== "SmolLM2-135M"
+
+    SmolLM2 is a smaller, more manageable model with 135M parameters. It is similar to GPT-2 but with a few improvements. A perfect choice for testing and getting familiar with Fast-LLM. We'll grab the model from Huggingface Hub and save it to our inputs folder.
+
+    === "Docker"
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M ~/inputs/SmolLM2-135M
+        ```
+
+    === "Local Environment"
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M /mnt/inputs/SmolLM2-135M
+        ```
+
+    === "Slurm"
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M ~/inputs/SmolLM2-135M
+        ```
+
+    === "Kubernetes"
+
+        ```bash
+        kubectl exec -it fast-llm-data-management -- /bin/bash
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M /mnt/inputs/SmolLM2-135M
+        ```
+
+=== "Llama-3.2-1B"
+
+    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder.
+
+    !!! note "Access Required"
+    
+        Meta gates access to their Llama models. You need to request access to the model from Meta before you can download it at https://huggingface.co/meta-llama/Llama-3.2-1B.
+
+    === "Docker"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs/Llama-3.2-1B
+        ```
+
+    === "Local Environment"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
+        ```
+    
+    === "Slurm"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs/Llama-3.2-1B
+        ```
+    
+    === "Kubernetes"
+    
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        kubectl exec -it fast-llm-data-management -- /bin/bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+        
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
+        ```
+
+!!! tip "Model Size Matters"
+
+    Smaller models like SmolLM2-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
+
+## 📚 Step 3: Prepare the Training Data
+
+For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
+
+Create a configuration file for the dataset preparation. Copy the following content:
+
+=== "SmolLM2-135M"
+
+    ```yaml
+    output_path: /mnt/inputs/openwebtext-SmolLM2
+
+    loading_workers: 4
+    tokenize_workers: 4
+    saving_workers: 4
+
+    dataset:
+      path: openwebtext
+      trust_remote_code: true
+
+    tokenizer:
+      path: /mnt/inputs/SmolLM2-135M/tokenizer.json
+
+    remove_downloads: false
+    ```
+
+=== "Llama-3.2-1B"
+
+    ```yaml
+    output_path: /mnt/inputs/openwebtext-Llama
+
+    loading_workers: 4
+    tokenize_workers: 4
+    saving_workers: 4
+
+    dataset:
+      path: openwebtext
+      trust_remote_code: true
+    
+    tokenizer:
+      path: /mnt/inputs/Llama-3.2-1B/tokenizer.json
+    
+    remove_downloads: false
+    ```
+
+and save it as `prepare-config.yaml` in your inputs folder.
+
+Fast-LLM ships with a `prepare` command that'll download and preprocess the dataset for you. Run it like this:
+
+=== "Docker"
+
+    ```bash
+    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/mnt/inputs \
+        fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml
+    ```
+
+=== "Local Environment"
+
+    ```bash
+    fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml
+    ```
+
+=== "Slurm"
+
+    ```bash
+    sbatch <<EOF
+    #!/bin/bash
+    # SBATCH --nodes=1
+    # SBATCH --ntasks-per-node=1
+    # SBATCH --exclusive
+    # SBATCH --output=/mnt/results/job_output.log
+    # SBATCH --error=/mnt/results/job_error.log
+
+    srun \
+        --container-image="ghcr.io/servicenow/fast-llm:latest" \
+        --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
+        --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+        bash -c "fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml"
+    EOF
+    ```
+
+    You can follow the job's progress by running `squeue -u $USER` and checking the logs in `job_output.log` and `job_error.log` in your results folder.
+
+=== "Kubernetes"
+
+    ```bash
+    kubectl apply -f prepare-job.yaml
+    ```
+
+    where `prepare-job.yaml` is a file containing the following configuration:
+
+    ```yaml
+    apiVersion: batch/v1
+    kind: Job
+    metadata:
+      name: fast-llm-prepare
+    spec:
+      template:
+        spec:
+          containers:
+            - name: fast-llm-prepare-container
+              image: ghcr.io/servicenow/fast-llm:latest
+              command: ["fast-llm", "prepare", "gpt_memmap"]
+              args:
+                - "--config"
+                - "/mnt/inputs/prepare-config.yaml"
+              resources:
+                requests:
+                  cpu: 4
+              volumeMounts:
+                - name: inputs
+                  mountPath: /mnt/inputs
+          volumes:
+            - name: inputs
+              persistentVolumeClaim:
+                claimName: pvc-fast-llm-inputs
+    ```
+
+    You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs fast-llm-prepare`.
+
+!!! tip "Use a Smaller Dataset for Testing"
+
+    The full OpenWebText dataset is quite large and will take a while to process, around 2 hours. If you're just testing things out, you can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
+
+## ⚙️ Step 4: Configure Fast-LLM
+
+Next, we'll create a configuration file for Fast-LLM. Save the following as `train-config.yaml` in your inputs folder:
+
+=== "SmolLM2-135M"
+
+    ```yaml
+    training:
+      train_iters: 600_000  # (1)!
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (2)!
+        format: llama
+        interval: 20_000
+      wandb:  # (3)!
+        project_name: fast-llm-quickstart
+        group_name: SmolLM2-135M
+        entity_name: servicenow
+    batch:
+      micro_batch_size: 60  # (4)!
+      sequence_length: 1024
+      batch_size: 480  # (5)!
+    data:
+      format: file
+      path: /mnt/inputs/openwebtext-SmolLM2/fast_llm_dataset.json  # (6)!
+      split: [99, 1, 0]  # (7)!
+    optimizer:  # (8)!
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:  # (9)!
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 600_000
+        warmup_iterations: 2000
+    pretrained:
+      format: llama  # (10)!
+      path: /mnt/inputs/SmolLM2-135M
+      model_weights: no  # (11)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes  # (12)!
+      multi_stage:
+        zero_stage: null  # (13)!
+      distributed:
+        training_dtype: bf16  # (14)!
+    run:
+      experiment_dir: /mnt/results/SmolLM2-135M
+    ```
+
+    1.  Total number of training tokens will be approximately 300B.
+    2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M and an A100-80GB, a `micro_batch_size` of 60 should work well.
+    5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
+    6.  Location of the dataset metadata file generated in Step 4.
+    7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
+    8.  These are good default optimizer settings for training models.
+    9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
+    10.  Format of the pretrained model. Since SmolLM is a Llama model, we set this to `llama`.
+    11.  We'll train SmolLM2-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in `~/inputs`).
+    12.  If you're using Ampere GPUs or higher, you can enable FlashAttention for faster training. Otherwise, set this to `no`. The default is `yes`.
+    13.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    14.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
+
+=== "Llama-3.2-1B"
+
+    ```yaml
+    training:
+      train_iters: 100_000  # (1)!
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (2)!
+        format: llama
+        interval: 20_000
+      wandb:  # (3)!
+        project_name: fast-llm-quickstart
+        group_name: Llama-3.2-1B
+        entity_name: servicenow
+    batch:
+      micro_batch_size: 20  # (4)!
+      sequence_length: 1024
+      batch_size: 480  # (5)!
+    data:
+      format: file
+      path: /mnt/inputs/openwebtext-Llama/fast_llm_dataset.json  # (6)!
+      split: [99, 1, 0]  # (7)!
+    optimizer:  # (8)!
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:  # (9)!
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    pretrained:
+      format: llama  # (10)!
+      path: /mnt/inputs/Llama-3.2-1B
+      model_weights: yes  # (11)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes  # (12)!
+        cross_entropy_impl: fused  # (13)!
+      multi_stage:
+        zero_stage: null  # (14)!
+      distributed:
+        training_dtype: bf16  # (15)!
+    run:
+      experiment_dir: /mnt/results/Llama-3.2-1B
+    ```
+
+    1.  Total number of training tokens will be approximately 300B.
+    2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For Llama-3.2-1B and an A100-80GB, a `micro_batch_size` of 20 should work well.
+    5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
+    6.  Location of the dataset metadata file generated in Step 4.
+    7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
+    8.  These are good default optimizer settings for training models.
+    9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
+    10.  Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
+    11.  We want to continue training Llama-3.2-1B from a checkpoint. If you're training from scratch, set this to `no`.
+    12.  If you're using Ampere GPUs or higher, you can enable FlashAttention for faster training. Otherwise, set this to `no`. The default is `yes`.
+    13.  Configure Fast-LLM to use the fused cross-entropy loss implementation rather than the default Triton implementation for Llama models. This avoids issues with block size limitations in our current Triton code, which can cause training failures.
+    14.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    15.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
+
+## 🔑 (Optional) Step 6: Add Your Weights & Biases API Key
+
+If you included the W&B section in your configuration, you'll need to add your API key. Save your W&B API key to `.wandb_api_key` in your inputs folder so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
+
+## 🚀 Step 7: Launch Training
+
+Alright, the big moment! Let's launch the training run.
+
+=== "Docker"
+
+    If you're on an 8-GPU machine, run the following to kick off training:
+
+    ```bash
+    docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/mnt/inputs \
+        -v ~/results:/mnt/results \
+        -e PYTHONHASHSEED=0 \
+        -e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key \
+        torchrun --standalone --nnodes 1 --nproc_per_node=8 --no_python \
+        fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
+    ```
+
+    !!! tip "Customize Your Docker Command"
+
+        * Adjust `--nproc_per_node` based on the number of GPUs you have available.
+        * Replace `--gpus all` with `--gpus '"device=0,1,2,3,4,5,6,7"'` if you want to use specific GPUs.
+        * Remove `-e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+
+=== "Local Environment"
+
+    If you have 8 GPUs available, run the following to start training:
+
+    ```bash
+    export PYTHONHASHSEED=0
+    export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key
+    torchrun --standalone --nnodes 1 --nproc_per_node=8 --no_python \
+        fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
+    ```
+
+    !!! tip "Customize Your Command"
+    
+        * Adjust `--nproc_per_node` based on the number of GPUs you have available.
+        * Remove `export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+
+=== "Slurm"
+
+    We create a Slurm batch script to run the training job. Save the following as `fast-llm.sbat`:
+
+    ```bash
+    #!/bin/bash
+    # SBATCH --job-name=fast-llm
+    # SBATCH --nodes=1
+    # SBATCH --gpus-per-node=8
+    # SBATCH --ntasks-per-node=1
+    # SBATCH --exclusive
+    # SBATCH --output=/mnt/outputs/job_output.log
+    # SBATCH --error=/mnt/outputs/job_error.log
+
+    export PYTHONHASHSEED=0
+    export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key
+
+    srun \
+        --container-image="ghcr.io/servicenow/fast-llm:latest" \
+        --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
+        --container-env="PYTHONHASHSEED,WANDB_API_KEY_PATH" \
+        --gpus-per-node=$SLURM_GPUS_PER_NODE \
+        --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+        bash -c "
+            torchrun \
+            --standalone \
+            --nnodes=\$SLURM_NNODES \
+            --nproc_per_node=\$SLURM_GPUS_PER_NODE \
+            --no_python \
+            fast-llm train gpt \
+            --config /mnt/inputs/fast-llm-config.yaml"
+    ```
+
+    !!! tip "Customize Your Slurm Script"
+
+        * Change the `--gpus-per-node` value to match the number of GPUs on your node.
+        * If you're not using W&B, remove the references to `WANDB_API_KEY_PATH`.
+
+    Submit the job to the Slurm cluster:
+
+    ```bash
+    sbatch fast-llm.sbat
+    ```
+
+=== "Kubernetes"
+
+    We create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource with the following configuration and save it as `fast-llm.pytorchjob.yaml`:
+
+    ```yaml
+    apiVersion: "kubeflow.org/v1"
+    kind: "PyTorchJob"
+    metadata:
+      name: "fast-llm"
+    spec:
+      nprocPerNode: "8"
+      pytorchReplicaSpecs:
+        Master:
+          replicas: 1
+          restartPolicy: Never
+          template:
+            spec:
+              tolerations:
+                - key: nvidia.com/gpu
+                  value: "true"
+                  operator: Equal
+                  effect: NoSchedule
+              containers:
+                - name: pytorch
+                  image: ghcr.io/servicenow/fast-llm:latest
+                  resources:
+                    limits:
+                      nvidia.com/gpu: 8
+                      rdma/rdma_shared_device_a: 1
+                      memory: "1024Gi"
+                      cpu:
+                    requests:
+                      nvidia.com/gpu: 8
+                      rdma/rdma_shared_device_a: 1
+                      memory: "1024Gi"
+                      cpu: 128
+                  command:
+                    - /bin/bash
+                    - -c
+                    - |
+                      torchrun --standalone
+                               --nnodes=${PET_NNODES} \
+                               --nproc_per_node=${PET_NPROC_PER_NODE} \
+                               --no_python \
+                               fast-llm train gpt \
+                               --config /mnt/inputs/fast-llm-config.yaml
+                  env:
+                    - name: PYTHONHASHSEED
+                      value: "0"
+                    - name: WANDB_API_KEY_PATH
+                      value: "/mnt/inputs/.wandb_api_key"
+                  securityContext:
+                    capabilities:
+                      add:
+                        - IPC_LOCK
+                  volumeMounts:
+                    - mountPath: /mnt/inputs
+                      name: fast-llm-inputs
+                    - mountPath: /mnt/results
+                      name: fast-llm-results
+                    - mountPath: /dev/shm
+                      name: dshm
+              volumes:
+                - name: fast-llm-inputs
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-inputs
+                - name: fast-llm-results
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-results
+                - name: dshm
+                  emptyDir:
+                    medium: Memory
+                    sizeLimit: "1024Gi"
+    ```
+
+    !!! tip "Customize Your PyTorchJob"
+
+        * Change the `nprocPerNode` value to match the number of GPUs on your node.
+        * If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
+
+    Submit the job to the Kubernetes cluster:
+
+    ```bash
+    kubectl apply -f fast-llm.pytorchjob.yaml
+    ```
+
+!!! warning "Python Hash Seed"
+
+    Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes. Training will fail if this isn't set.
+
+## 📊 Step 8. Track Training Progress
+
+=== "Docker"
+
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+=== "Local Environment"
+
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+=== "Slurm"
+
+    Use `squeue -u $USER` to see the job status.
+    Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    Fast-LLM will log training progress to those files every 10 iterations.
+
+=== "Kubernetes"
+
+    Use `kubectl get pods` to see the job status.
+    Use `kubectl logs fast-llm-master-0` to check the logs.
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+You can expect to see the following performance metrics in Fast-LLM's output:
+
+=== "SmolLM2-135M"
+
+    | Performance Metric  | 8x V100-SXM2-32GB[^SmolLM2-V100] | 8x A100-SXM4-80GB[^SmolLM2-A100] | 8x H100-SXM5-80GB[^SmolLM2-H100] |
+    |---------------------|---------------------------------:|---------------------------------:|---------------------------------:|
+    | tokens/s/GPU        | 18,300                           |                                  | 294,000                          |
+    | tflop/s (model)     | 16.7                             |                                  | 268                              |
+    | tflop/s (hardware)  | 17.0                             |                                  | 274                              |
+    | total training time | 23.3 days                        |                                  | 1.45 days                        |
+
+    [^SmolLM2-V100]:
+        `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
+        FlashAttention is not supported on V100 GPUs, so it was disabled.
+        Micro-batch size was set to 12.
+    [^SmolLM2-A100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to 60.
+    [^SmolLM2-H100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to 60.
+
+=== "Llama-3.2-1B"
+
+    | Performance Metric  | 8x V100-SXM2-32GB[^Llama-V100] | 8x A100-SXM4-80GB[^Llama-A100] | 8x H100-SXM5-80GB[^Llama-H100] |
+    |---------------------|-------------------------------:|-------------------------------:|-------------------------------:|
+    | tokens/s/GPU        | 5,680                          |                                | 66,600                         |
+    | tflop/s (model)     | 43.3                           |                                | 508                            |
+    | tflop/s (hardware)  | 43.4                           |                                | 510                            |
+    | total training time | 12.5 days                      |                                | 1.07 days                      |
+
+    [^Llama-V100]:
+        `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
+        FlashAttention is not supported on V100 GPUs, so it was disabled.
+        Micro-batch size was set to 4.
+    [^Llama-A100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to 20.
+    [^Llama-H100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to 20.
+
+If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
+
+## 🛠️ Troubleshooting Basics
+
+Here are some common issues you might encounter and how to address them:
+
+-   **CUDA Out of Memory**: Try lowering the `micro_batch_size` or `sequence_length` in your configuration to fit within available memory.
+
+-   **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
+
+## 🎉 Final Thoughts
+
+And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster. Happy training!
diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
new file mode 100644
index 00000000..159be53e
--- /dev/null
+++ b/docs/recipes/continue-training-llama-8b.md
@@ -0,0 +1,7 @@
+---
+title: Continual Pretraining of Llama 3.1 8B
+---
+
+!!! warning
+
+    This recipe’s still in the oven. Check back soon for the full details!
diff --git a/docs/recipes/data-preparation.md b/docs/recipes/data-preparation.md
new file mode 100644
index 00000000..bd47d602
--- /dev/null
+++ b/docs/recipes/data-preparation.md
@@ -0,0 +1,7 @@
+---
+title: Preparing Data for Training
+---
+
+!!! warning
+
+    This guide’s still in the works. Stay tuned—full instructions coming soon!
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
new file mode 100644
index 00000000..d76f2822
--- /dev/null
+++ b/docs/recipes/train-llama-8b.md
@@ -0,0 +1,7 @@
+---
+title: Training Llama 3.1 8B
+---
+
+!!! warning
+
+    Heads up! This guide isn't ready yet. Check back soon.
diff --git a/docs/recipes/upcycle-llama-3b-to-moe.md b/docs/recipes/upcycle-llama-3b-to-moe.md
new file mode 100644
index 00000000..28d5e264
--- /dev/null
+++ b/docs/recipes/upcycle-llama-3b-to-moe.md
@@ -0,0 +1,7 @@
+---
+title: Upcycling Llama 3B to MoE
+---
+
+!!! warning
+
+    This guide is under construction. Check back soon to see how to give your Llama 3B a new life as an MoE!
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
new file mode 100644
index 00000000..7d05293b
--- /dev/null
+++ b/docs/reference/configuration.md
@@ -0,0 +1,7 @@
+---
+title: Configuration Reference
+---
+
+!!! warning
+
+    Looking for the full config details? This reference is on the way. Stay tuned!
diff --git a/docs/reference/index.md b/docs/reference/index.md
deleted file mode 100644
index f4941c6c..00000000
--- a/docs/reference/index.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Reference
-
-Coming soon...
diff --git a/docs/refs.bib b/docs/refs.bib
new file mode 100644
index 00000000..605cd630
--- /dev/null
+++ b/docs/refs.bib
@@ -0,0 +1,13 @@
+@article{li2023starcoder,
+  title={Starcoder: may the source be with you!},
+  author={Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others},
+  journal={arXiv preprint arXiv:2305.06161},
+  year={2023}
+}
+
+@article{lozhkov2024starcoder,
+  title={Starcoder 2 and the stack v2: The next generation},
+  author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
+  journal={arXiv preprint arXiv:2402.19173},
+  year={2024}
+}
diff --git a/docs/success-stories/starcoder-2.md b/docs/success-stories/starcoder-2.md
new file mode 100644
index 00000000..577d8587
--- /dev/null
+++ b/docs/success-stories/starcoder-2.md
@@ -0,0 +1,25 @@
+---
+title: "StarCoder2"
+---
+
+2023 was a transformative year for ServiceNow Research's Foundation Model Lab. Partnering with [BigCode](https://www.bigcode-project.org), we set out to build **StarCoder2** [@lozhkov2024starcoder], an open-source language model designed specifically for coding tasks. This iteration of StarCoder [@li2023starcoder] has been built to handle a wide range of programming languages with performance on par with some larger models.
+
+Our goal was ambitious: to train the [3-billion-parameter StarCoder2 model](https://huggingface.co/bigcode/starcoder2-3b) on over **3 trillion tokens** from **The Stack V2**—a rich, diverse dataset compiled by BigCode from the Software Heritage archive. This data provided StarCoder2 with the breadth of real-world code examples and programming paradigms it needed to tackle complex coding tasks with high accuracy and deep contextual understanding.
+
+To bring StarCoder2 to life, we ran Fast-LLM on **NVIDIA's DGX SuperCloud**, utilizing **DGX A100-80GB nodes**. Fast-LLM allowed us to maximize GPU throughput and streamline our entire training pipeline. The complexity of scaling StarCoder2's training across nodes became a seamless experience.
+
+## How Fast-LLM Made StarCoder2 Possible
+
+Fast-LLM was designed to maximize efficiency in large-scale language model training—especially for tasks like StarCoder2. Here's how Fast-LLM's capabilities helped us achieve our goals:
+
+-   **Optimized Throughput and GPU Utilization**: Fast-LLM's data parallelism allowed each A100-80GB GPU to operate at its peak, sustaining **10,000 tokens per second** throughput. This boosted GPU utilization and brought down training time by **20%** compared to other frameworks. Fast-LLM made sure every GPU cycle was used efficiently, cutting down on idle time across the board.
+
+-   **Support for Long Contexts**: With Fast-LLM's built-in Grouped Query Attention (GQA), StarCoder2-3B was able to leverage a **16,384 token context window**. This is essential for code comprehension, where context often spans hundreds of lines or more. GQA enabled the model to hold extensive context across sequences, which translates into better understanding of long code snippets, in-depth documentation, and detailed coding conversations.
+
+-   **Fill-in-the-Middle (FIM) Training**: Fast-LLM supported FIM training objectives natively, allowing StarCoder2-3B to complete and understand code by predicting missing snippets in various contexts. This structure-focused training enhanced the model's performance, making it adept at understanding code structure, flow, and syntax.
+
+## The Takeaway
+
+StarCoder2-3B is the first large-scale, real-world demonstration of Fast-LLM's capabilities in specialized language model training. This project exemplifies how Fast-LLM not only powers large models but does so with adaptability and efficiency. It's not just about achieving results—it's about doing so in a way that's replicable and accessible to labs of all sizes.
+
+With Fast-LLM, we've made a leap in efficiency and performance, setting the stage for future innovation in LLM training. This is just the beginning, and we're excited to see how Fast-LLM will continue to push the boundaries of language model development for coding and beyond.
diff --git a/docs/tutorial/convert_to_huggingface.md b/docs/tutorial/convert_to_huggingface.md
deleted file mode 100644
index 8d1302a0..00000000
--- a/docs/tutorial/convert_to_huggingface.md
+++ /dev/null
@@ -1,50 +0,0 @@
-# Converting Fast-LLM Models to Hugging Face Format
-
-Now that we have trained a Mistral model, the natural next step is to try it for inference or benchmarks.
-Fast-LLM does not support such task (at least for the time being),
-but instead supports conversion to [Huggingface transformers](https://github.com/huggingface/transformers) models,
-which are themselves compatible with a large variety of tools.
-
-This article guides you through the conversion process for a Mistral-7B checkpoint (export)
-generated during training as described in [the previous tutorial](launch_training.md).
-This checkpoint may be found at `$EXP_BASE_DIR/export/$ITERATION/`.
-Allow some time for the first checkpoint to be generated.
-
-
-## Convert a Mistral-7B checkpoint
-
-We convert the checkpoint with Fast-LLM's
-[conversion script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/convert_model.py),
-and we specify the input and output locations and formats:
-
-```bash
-python3 -m tools.convert_model \
-    --input_type distributed \
-    --output_type huggingface \
-    --input_path $EXP_BASE_DIR/export/$ITERATION/ \
-    --output_path $CONVERTED_DIR \
-    --model_type mistral
-```
-
-<!--- TODO: What Tokenizer? --->
-
-!!! warning "Don't Forget the Tokenizer"
-
-    Make sure to add a tokenizer file and its configuration to the output directory, since `convert_model.py` does not include these files in the conversion.
-
-
-<!--- TODO: What Tokenizer? --->
-
-You can then load and use the converted model
-[as you would with any Transformers model](https://huggingface.co/docs/transformers/index).
-For example:
-```python
-import torch
-from transformers import AutoModelForCausalLM
-
-import transformers
-
-model = AutoModelForCausalLM.from_pretrained(converted_dir).to(device="cuda")
-x = torch.randint(0, 32000, (1, 1024))
-y = model(x)
-```
diff --git a/docs/tutorial/getting_started.md b/docs/tutorial/getting_started.md
deleted file mode 100644
index c5f6bd66..00000000
--- a/docs/tutorial/getting_started.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# Getting Started
-
-<!--- TODO: Remove the ServiceNow-specific content. --->
-
-## Build the image
-
-!!! warning
-
-    This guide is not yet working.
-
-The preferred way to run [Fast-LLM](https://github.com/ServiceNow/Fast-LLM) is through a docker image built with the provided Dockerfile.
-For example, from a terminal running on a GPU node:
-
-```bash
-git clone git@github.com:ServiceNow/Fast-LLM.git
-cd Fast-LLM
-docker build -t my_fast_llm_image .
-docker run --rm -it --gpus all --net=host --ipc=host my_fast_llm_image bash
-```
-
-## First examples
-
-All training runs are launched throught the entry point [pretrain_fast_llm.py](https://github.com/ServiceNow/Fast-LLM/blob/main/pretrain_fast_llm.py).
-We can run a minimalistic training example with:
-```bash
-python3 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-This will launch a short single-GPU training from scratch of a 180 M parameter model on a randomly generated dataset.
-
-To run distributed training, we run our training script through [torchrun](https://pytorch.org/docs/stable/elastic/run.html),
-the PyTorch distributed launcher. For example, on 8 GPUs:
-```bash
-torchrun --nproc-per-node=8 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-Note that by default, Fast-LLM parallelizes over samples (data-parallel), so the number of GPUs should divide the batch size.
-
-Multi-node training also uses torchrun, and requires the same command to be run on each node,
-with the additional specification of a rendez-vous endpoint, i.e., the address of one of the nodes.
-For example, on four nodes:
-```bash
-torchrun --nproc-per-node=8 --nnodes=4 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-
-See the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html) for more details.
-Note that if you are using cloud or managed hardware, there Now tutorial](servicenow.md)
-may be a simpler automated method to launch multi-node jobs.
-Please refer to your provider for more details.
-The ServiceNow-specific method may be found in the [Service
-
-## More on training arguments
-
-<!--- TODO: Document arguments --->
-
-The training script supports hundreds of arguments, though most of them are optional and/or have sensible defaults.
-We already saw three arguments above, and we will see many important ones in this tutorial.
-
-At the beginning of training, Fast-LLM displays a list of arguments and their values:
-```
------------------------- arguments ------------------------
-  activation_type ................................. gelu
-  adam_beta1 ...................................... 0.9
-  adam_beta2 ...................................... 0.999
-  adam_eps ........................................ 1e-08
-  add_linear_biases ............................... True
-  attention_dropout ............................... 0.0
-  batch_size ...................................... 1
-  [...]
--------------------- end of arguments ---------------------
-```
-All of these arguments can be set as arguments of `pretrain_fast_llm.py`, in the form `--[name]=[value]`,
-provided the values have the expected data type, and in some case satisfy extra constraints.
-For example, we may enable attention dropout with `--attention_dropout=0.1`.
-Note that booleans are set as integers (ex. `--add_linear_biases=0` to disable biases),
-and that `None` cannot be represented.
-Please refer to each parameter's definition for more details.
diff --git a/docs/tutorial/index.md b/docs/tutorial/index.md
deleted file mode 100644
index 00f38a7f..00000000
--- a/docs/tutorial/index.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# Tutorial
-
-
-This guide will teach how to pretrain and/or extend pretraining of language models such as Mistral-7B with Fast-LLM on multiple GPU nodes.
-Such training requires a careful selection and optimization of:
-- The training hardware: GPU node specs, count and interconnect.
-- The model architecture: layer types, hidden sizes, activations, etc.
-- The training dataset and its sampling.
-- The training parameters: optimizer, learning rate schedule, training duration, etc.
-- The training performance optimizations: distributed layout, activation recomputation, etc.
-
-When training a model with Fast-LLM (and other training libraries),
-we generally assume the first four points to be predetermined as they are unrelated to the training framework,
-and focus on the last one, i.e., we optimize a fixed training scheme for throughput.
-(However, in practice the batch size may be adjusted together with the distributed layout,
-which in turn affects the training schedule.)
-
-In this tutorial, we follow the extended pretraining for Mistral-7B over a corpus of 500 billion tokens using 16 DGX nodes,
-each equipped with 8 A100 or H100 GPUs (totalling 128 GPUs).
-We also explore some alternative settings such as training from scratch and the Mixtral-8x7B model.
-
-
-- [Getting started](getting_started.md): Get started with Fast-LLM, set up and run a first training configuration.
-- [Load Mistral-7B](prepare_mistral.md): Define the model architecture, download a checkpoint from the Huggingface Hub and load it in Fast-LLM.
-- [Prepare and load the dataset](prepare_data.md): Prepare and configure the dataset.
-- [Prepare the training configuration](prepare_training.md): Configure the optimizer, schedule, distributed layout, etc.
-- [Launch and monitor training](launch_training.md): Launch training, configure and view experiment outputs.
-- [Convert to Hugging Face](convert_to_huggingface.md): Convert to Hugging Face format and upload it to the Hugging Face model hub.
diff --git a/docs/tutorial/launch_training.md b/docs/tutorial/launch_training.md
deleted file mode 100644
index 823d2c97..00000000
--- a/docs/tutorial/launch_training.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Launch and monitor training
-
-## Requirements
-
-At this point, you should already have:
-
-- Access to a cluster with 16 DGX nodes with 8x A100/H100-80GB GPUs ([Or at least 4 GPUs](prepare_training.md)),
-connected through an Infiniband (preferred) and/or Ethernet interconnect,
-and sharing a common fast storage.
-- A [docker image](getting_started.md) for Fast-LLM, available on all nodes.
-- A local copy of the [Mistral weights](prepare_mistral.md) on the common storage
-- A [preprocessed dataset](prepare_data.md) in json format on the common storage.
-- (Optional) A Wandb account and API key.
-
-
-## Launching the experiment
-
-To launch the experiment, we perform the following on each node,
-or use a cluster-specific tool to automate the process:
-1. Launch a docker container running our docker image,
-ensuring access to all necessary hardware (GPUs, interconnects, etc.),
-and mounting the pretrained weights, dataset and an experiment directory.
-    ```bash
-   docker run --rm -it --gpus all --net=host --ipc=host [-v ...] my_fast_llm_image bash
-    ```
-2. Note the mounted paths and host address:
-    ```bash
-    export PRETRAINED_MISTRAL_PATH=...
-    export JSON_DATA_PATH=...
-    export EXP_BASE_DIR=...
-    export HOST_NODE_ADDR=...
-    ```
-3. Set up the experiment configuration as described in the previous sections:
-    ```bash
-
-    export ARCHITECTURE_ARGS_MISTRAL_PRETRAINED="\
-   --pretrained_checkpoint_type=huggingface \
-   --pretrained_checkpoint_path=$PRETRAINED_MISTRAL_PATH \
-   "
-
-   export MODEL_ARGS_MISTRAL_PRETRAINED="\
-   $ARCHITECTURE_ARGS_MISTRAL_PRETRAINED \
-   --window_size=4096 \
-   "
-
-    export DATA_ARGS="\
-    --split=9998,2,0 \
-    --dataset_source=file \
-    --data_path=$JSON_DATA_PATH \
-    "
-
-    export TRAINING_ARGS="\
-    --batch_size=128 \
-    --sequence_length=8192 \
-    --train_iters=500000 \
-    --weight_decay=0.1 \
-    --adam_beta1=0.9 \
-    --adam_beta2=0.95 \
-    --clip_grad=1.0 \
-    --lr=0.0001 \
-    --lr_warmup_iters=1000 \
-    --lr_decay_style=cosine \
-    --lr_decay_iters=500000 \
-    --min_lr=0.000003 \
-    "
-
-    export PERFORMANCE_ARGS="\
-    --training_dtype=bf16 \
-    --num_workers=8 \
-    "
-
-    export MONITORING_ARGS="\
-    --experiment_dir=$EXP_BASE_DIR \
-    --validation_iters=25 \
-    --validation_interval=1000 \
-    --max_checkpoints=5 \
-    --export_interval=25000 \
-    --log_interval=10 \
-    --log_offset=0 \
-    --checkpoint_interval=500 \
-    "
-    ```
-4. Launch the experiment:
-    ```bash
-    torchrun --nproc-per-node=8 --nnodes=16 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py \
-    $MODEL_ARGS_MISTRAL_PRETRAINED $DATA_ARGS $TRAINING_ARGS $PERFORMANCE_ARGS $MONITORING_ARGS
-    ```
-
-## Monitoring the experiment
-
-After launching the experiment, you may observe the progress through either stdout,
-or the log file at `[EXP_BASE_DIR]/runs/0/logs/logs_rank_000.txt`.
-If you set up Wandb logging, progress will also be reported there.
diff --git a/docs/tutorial/prepare_data.md b/docs/tutorial/prepare_data.md
deleted file mode 100644
index 28833e83..00000000
--- a/docs/tutorial/prepare_data.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# Training Data Preparation
-
-<!--- TODO: Provide an actual example dataset --->
-
-## Prepare datasets
-
-<!--- TODO: Tokenizer? --->
-
-The data processing of Fast-LLM is designed to closely match that of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
-In particular, it requires datasets to be converted to the Megatron-LM binary format.
-Please refer to [this guide](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing)
-for details on how to prepare the dataset(s).
-
-At the end of this process, each dataset should have a consist of a binary file `$DATA_PREFIX_[i].bin` and an index file `$DATA_PREFIX_[i].idx`
-
-## List configuration
-
-Datasets may be configured via a simple string in the `--data_path` argument.
-(Again, in the exact same format as with Megatron-LM).
-For a single dataset, we only need to specify its prefix:
-```bash
-export DATA_ARGS_SINGLE="\
---split=9998,2,0 \
---dataset_source=list \
---data_path=$DATA_PREFIX_0 \
-"
-```
-Note that we also specify a train/validation/test split for the dataset.
-Fow multiple datasets, we specify the prefixes together with relative dataset sampling probabilities.
-For examples
-```bash
-export DATA_ARGS_MULTIPLE="\
---split=9998,2,0 \
---dataset_source=list \
---data_path=\"0.3 $DATA_PREFIX_0 0.5 $DATA_PREFIX_1 0.2 $DATA_PREFIX_2\" \
-"
-```
-
-!!! warning
-
-    The same dataset split is used for every dataset.
-    This may cause problems for extremely small datasets, which we recommend avoiding.
-    (If needed, we suggest concatenating small datasets into larger ones.)
-
-!!! warning
-
-    Make sure to dedicate enough data for validation and/or testing, and adjust the split according to you dataset.
-    Our setup assumes a dataset of 500 billion tokens, and requires 26 million tokens for each validation,
-    so allocating 0.02% of the total data (100 million tokens)
-    ensures sufficient data without excessively reducing the training set size.
-
-
-## Json configuration
-
-While the list configuration is sufficient for a small number of datasets,
-it becomes impractical when there are many of them.
-For that purpose, Fast-LLM allows configuring a dataset from an external json file.
-
-A common use case concerns large datasets with hundreds of billions of tokens,
-which need to be split into multiple ones to keep the file size reasonable.
-We want to sample each dataset as if it was not split, i.e. with probability proportional to its document count.
-In that case, the json configuration file can be generated automatically using the `concatenate_dataset.py` script:
-```bash
-python3 tools/concatenate_dataset.py --directory=$DATASET_DIR --output_name=$JSON_DATA_PATH
-"
-```
-This script will recursively scan `$DATASET_DIR` for datasets (`.idx` files),
-and create a json dataset configuration at `$JSON_DATA_PATH` with the appropriate dataset prefixes and probabilities.
-The resulting json file can be used to configure the datasets:
-```bash
-export DATA_ARGS="\
---split=9998,2,0 \
---dataset_source=file \
---data_path=$JSON_DATA_PATH \
-"
-```
-
-??? question "More on the json dataset file"
-
-    The json dataset file is a simple structure for holding the data prefixes and probabilities,
-    to avoid writing them explicitly in the Fast-LLM configuration.
-    It may be created manually or through a script such as `concatenate_dataset.py`
-    It may also contain metadata about the dataset contents, for example the total number of tokens and documents.
-    The file should be structured as:
-    ```json
-    {
-        "datasets": [
-            {
-                "prefix": $RELATIVE_DATA_PREFIX_0"
-                "weight": 0.3
-                "num_documents": 12345,
-                "num_tokens": 987654321,
-                ...
-            },
-            ...
-        ]
-    }
-    ```
-    Note that in the json format, paths are relative to the directory containing the json file
-    instead of the current working directory.
diff --git a/docs/tutorial/prepare_mistral.md b/docs/tutorial/prepare_mistral.md
deleted file mode 100644
index acfa391f..00000000
--- a/docs/tutorial/prepare_mistral.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# Load Mistral-7B
-
-## Download pretrained weights
-
-Since we are interested in extending the pretraining of Mistral-7B, the first step is to obtain the pretrained weights.
-We do so by downloading them from the [Huggingface Hub](https://huggingface.co/mistralai/Mistral-7B-v0.1).
-This requires:
-
-- Git lfs (`git lfs install`).
-- An account for the Huggingface Hub, together with an [access token](https://huggingface.co/docs/hub/security-tokens).
-- Permission to use [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), obtained by accepting the terms and conditions.
-
-Then, clone the repository to download the weights (use the access token as password).
-```bash
-git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 $PRETRAINED_CHECKPOINT_PATH
-```
-
-
-## Load the model in Fast-LLM
-
-Fast-LLM may load the model architecture and pretrained weights of supported Huggingface models directly at the beginning of training.
-To do so, we simply specify the pretrained checkpoint format and location,
-which overrides the model architecture with Mistral-7B.
-```bash
-export ARCHITECTURE_ARGS_MISTRAL_PRETRAINED="\
---pretrained_checkpoint_type=huggingface \
---pretrained_checkpoint_path=$PRETRAINED_MISTRAL_PATH \
-"
-```
-
-To obtain the full model configuration, we also need to set the non-architecture parameters,
-which are not imported during conversion.
-
-```bash
-export MODEL_ARGS_MISTRAL_PRETRAINED="\
-$ARCHITECTURE_ARGS_MISTRAL_PRETRAINED \
---window_size=4096 \
-"
-```
-
-!!! warning
-
-    Make sure to check which model parameters are part of the architecture and which ones are not,
-    and set all required non-architecture parameters explicitly.
-
-!!! warning
-
-    Make sure the downloaded checkpoint is accessible to every worker, and adjust the path as needed.
-
-
-## (Optional) Train from scratch
-
-If we want to train a Mistral-7B model from scratch, we may still load the architecture from the Huggingface repo:
-```bash
-export ARCHITECTURE_ARGS_MISTRAL_FROM_SCRATCH="\
---pretrained_checkpoint_type=huggingface \
---pretrained_checkpoint_path=$PRETRAINED_CHECKPOINT_PATH \
---load_pretrained_weights=0 \
-"
-```
-
-Alternatively, we may specify the architecture explicitly, which makes it easier to adjust the parameters.
-```bash
-export ARCHITECTURE_ARGS_MISTRAL="\
---num_layers=32 \
---hidden_size=4096 \
---vocab_size=32000 \
---num_attention_heads=32 \
---head_groups=8 \
---add_linear_biases=0 \
---ffn_hidden_size=14336 \
---kv_channels=128 \
---use_rotary_embeddings=1 \
---rotary_embedding_scale=-9.210340371976184 \
---gated=1 \
---activation_type=silu \
---normalization_type=rms_norm \
---tie_word_embeddings=0 \
-"
-```
-
-Please refer to the trainer config for additional extended pretraining options.
-
-
-## (Optional) Train Mixtral-8x7B
-
-<!--- TODO: Move to separate file? --->
-
-We may train Mixtral-8x7B instead, which simply requires pointing to a different checkpoint:
-
-```bash
-git clone https://huggingface.co/mistralai/Mistral-7B-v0.1Mixtral-8x7B-v0.1 $PRETRAINED_CHECKPOINT_PATH
-```
-Other than a small memory optimization, this tutorial can be run as-is with Mixtral-8x7B.
-The architecture is a slight vatiation of Mistral-7B:
-```bash
-export ARCHITECTURE_ARGS_MIXTRAL="\
-$ARCHITECTURE_ARGS_MISTRAL \
---num_experts=8 \
---num_experts_per_token=2 \
-"
-```
diff --git a/docs/tutorial/prepare_training.md b/docs/tutorial/prepare_training.md
deleted file mode 100644
index dda8f74a..00000000
--- a/docs/tutorial/prepare_training.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# Prepare the training configuration
-
-# Training parameters
-
-Our example training scheme is as follows:
-1. We train over 500 K iteration, each made of 128 samples of 8192 tokens, for a total of 524 B training tokens.
-2. We use the Adam optimizer with weight decay (Adamw), and gradient clipping.
-3. We warm up the learning rate for the first 1000 steps, then use cosine decay from 1e-4 to 3e-6.
-
-This translates into the following Fast-LLM configuration:
-```bash
-export TRAINING_ARGS="\
---batch_size=128 \
---sequence_length=8192 \
---train_iters=500000 \
---weight_decay=0.1 \
---adam_beta1=0.9 \
---adam_beta2=0.95 \
---clip_grad=1.0 \
---lr=0.0001 \
---lr_warmup_iters=1000 \
---lr_decay_style=cosine \
---lr_decay_iters=500000 \
---min_lr=0.000003 \
-"
-```
-
-# Performance parameters
-
-Our training setup is simple enough that the default distributed configuration
-(data parallel with [ZeRO stage 1](https://www.deepspeed.ai/tutorials/zero/))
-is sufficient for a near-optimal training throughput of around 9000 tokens/s/GPU on H100 GPUs (440 tflops/GPU).
-We only need to specify the training dtype and the number of data loader workers.
-```bash
-export PERFORMANCE_ARGS="\
---training_dtype=bf16 \
---num_workers=8 \
-"
-```
-
-Note that this configuration requires exactly 16 nodes.
-It may be adjusted run on fewer than 16 nodes,
-by using gradient accumulation to keep the micro-batch size constant and adding some memory optimizations.
-We suggest the following configuration for 4 to 64 GPUs (seet details the in next section):
-```bash
-export PERFORMANCE_ARGS_SMALL_CLUSTER="\
-$PERFORMANCE_ARGS \
---micro_batch_size=1 \
---zero_stage=2 \
-"
-```
-
-# (Optional) More on Mistral performance optimization
-
-The performance optimization of Mistral at the configuration level
-is mainly determined through the following guidelines:
-
-- **Use larger micro-batches**: The GPU runs more efficiently with larger kernels,
-so we want the micro-batches to be as large as allowed by memory and other constraints.
-Our configuration requires 36 GiB of activation memory,
-so a micro-batch or 8192 tokens per GPU is a reasonable choice.
-A value of 16384 tokens per GPU is technically feasible,
-but would require aggressive state memory optimizations and a higher batch size.
-2 - **Reduce model parallelism**: Model parallelism (tensor or pipeline) comes with a large overhead,
-so we should avoid or limit it whenever possible.
-For Mistral, no model parallelism is needed.
-3 - **Optimize the memory usage**: Additional memory optimizations are available to enable configurations that would
-otherwise not be possible. We already saw the most important one, the ZeRO stage (`--zero_stage` see note below).
-An additional one is the recomputation of the MLP activations `--mlp_recompute_level` ,
-which significantly lower the activation memory usage, for a small (`activation`) or moderate (`full`) overhead.
-Note that Fast-LLM does not implement activation recomputation for the entire transformer layer,
-as it comes with a large overhead (~33%) and it can be avoided in (almost) all practical scenario.
-
-
-??? note "More on ZeRO stages"
-
-    Fast-LLM provides a custom implementation of the training state partitioning
-    first described in the [ZeRO (Zero Redundancy Optimizer) paper](https://arxiv.org/abs/1910.02054).
-    The method comes in three "stages", which progressively reduce the memory footprint from the training state:
-
-    - **Stage 1**: Partition the optimizer state and its update across the data-parallel GPUs.
-      This stage reduces the state memory by around 3x (for mixed precision training with full-precision gradients),
-      while simultanuously speeding up training through a faster weight update.
-
-    - **Stage 2**: Extend partitioning to the (reduced) gradients.
-      This stage reduces the state memory by a further 3x,
-      but may come with a minor overhead (depending on the implementation),
-      and may require multiple reductions with gradient accumulation.
-
-    - **Stage 3**: Extend partitioning to the weights.
-      This stage drops the vast majority of the remaining state memory,
-      but requires extra network communication.
-
-    Fast-LLM implements all three of these stages, selected through the `--zero_stage` argument.
-    There is no option to disable ZeRO entirely, as it would be strictly worse in terms of performance.
-    In general, training configurations should use the lowest value allowed by other memory constraints.
-
-??? note "Recompute Level for MLPs"
-
-    The MLP is the largest contributor to a transformer's activation memory (with Flash Attention),
-    so recomputing its activations is a natural way to save memory.
-    Fast-LLM offers three MLP recomputaton modes, set throught the `--mlp_recompute_level` argument:
-
-    - **`none`** (default): All MLP activations are kept,
-    allowing for the highest throughput at the highest memory cost.
-
-    - **`activation`**: The MLP activation layer output (gelu, silu, etc.) is dropped and recomputed in the backward pass.
-    This saves on activation memory (~20% for Mistral) with minimal impact on throughput.
-
-    - **`full`**: Both the first dense layer and activation layer outputs are dropped and recomputed.
-    This saves more activation memory (~60% for Mistral), but has a noticeable impact on throughput .
-
-    For quantitative comparison, here are benchmarks for Mistral (using 4x A100 GPUs):
-
-    | Recompute Level | Act. Memory (MiB) | Tokens/s/GPU | Model TFLOP/s/GPU |
-    |-----------------|-------------------|--------------|---------------|
-    | `none`          | 36515             | 4234.09      | 202.88        |
-    | `activation`    | 29346             | 4218.63      | 202.14        |
-    | `full`          | 15010             | 3804.49      | 182.29        |
-
-
-# Monitoring and persistence parameters
-
-Finally, we set up experiment monitoring and persistence
-```bash
-export MONITORING_ARGS="\
---experiment_dir=$EXP_BASE_DIR \
---validation_iters=25 \
---validation_interval=1000 \
---max_checkpoints=5 \
---export_interval=25000 \
---log_interval=10 \
---log_offset=0 \
---checkpoint_interval=500 \
-"
-```
-This setup includes:
-- Creation of an experiment directory at `$EXP_BASE_DIR` to store checkpoints, logs, data cache and other artifacts.
-- Validation for 25 steps every 1000 steps
-- Logging of losses, metrics and other relevant quantities every 10 steps (from rank 0),
-  both to stdout and the log file.
-- Saving of a temporary checkpoint every 500 steps, and of a permanent checkpoint every 25000 steps.
-
-
-??? note "More on Fast-LLM checkpointing"
-
-    Fast-LLM provides two types of checkpoints:
-
-    - `checkpoint`: temporary checkpoint saved at `[--experiment_dir]/checkpoints/[iter]`,
-      to reload the experiment in case of a planned or unexpected shutdown.
-      Only the `--max_checkpoints` most recent ones are kept to limit disk usage.
-      Note that saving a checkpoint with Fast-LLM is relatively fast so can (and should) be done frequently.
-    - `export`: permanent checkpoint saved at `[--experiment_dir]/export/[iter]`.
-      This checkpoint type is typically intended for long-term storage, benchmarking, inference, etc.
-      It should be saved less often to limit disk usage.
-
-
-# (Optional) Set up wandb
-
-Fast-LLM also support monitoring through [Weights and Biases](https://wandb.ai/).
-This requires a valid API key,
-passed through an environment variable rather than an explicit argument for security reasons.
-It can be either contained in `$WANDB_API_KEY` or in a plain text file found at `$WANDB_API_KEY_PATH`.
-Then, we set the Wandb username, project and version (Wandb group).
-```bash
-export WANDB_ARGS="\
---wandb_entity_name=$WANDB_ENTITY_NAME \
---wandb_project_name=$PROJECT_NAME \
---wandb_group_name=$PROJECT_VERSION \
-"
-```
-The Wandb run will be set as the directory name of `$EXP_BASE_DIR`, or can be overriden through `--experiment_name`.
diff --git a/examples/fast-llm.pytorchjob.yaml b/examples/fast-llm.pytorchjob.yaml
index 9decff91..13a7a4df 100644
--- a/examples/fast-llm.pytorchjob.yaml
+++ b/examples/fast-llm.pytorchjob.yaml
@@ -17,7 +17,7 @@ spec:
               effect: NoSchedule
           containers:
             - name: pytorch
-              image: servicenowdocker/fast-llm:latest
+              image: ghcr.io/servicenow/fast-llm:latest
               resources:
                 limits:
                   nvidia.com/gpu: 8
@@ -77,7 +77,7 @@ spec:
               effect: NoSchedule
           containers:
             - name: pytorch
-              image: servicenowdocker/fast-llm:latest
+              image: ghcr.io/servicenow/fast-llm:latest
               resources:
                 limits:
                   nvidia.com/gpu: 8
diff --git a/fast_llm/data/preparator/gpt_memmap/prepare.py b/fast_llm/data/preparator/gpt_memmap/prepare.py
index c51bd4a7..fccb7945 100644
--- a/fast_llm/data/preparator/gpt_memmap/prepare.py
+++ b/fast_llm/data/preparator/gpt_memmap/prepare.py
@@ -34,7 +34,6 @@ def _tokenize_batch(self, batch):
         }
 
     def _save_shard(self, args) -> dict:
-
         shard_idx, shard_dataset = args
         prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
         shard_output_path = self._config.output_path / prefix
@@ -51,7 +50,6 @@ def _save_shard(self, args) -> dict:
         return dataset_dict
 
     def run(self):
-
         # Set transformers logging verbosity
         transformers.logging.set_verbosity_error()
 
@@ -159,4 +157,4 @@ def run(self):
 
         # Clean up downloaded dataset
         if self._config.remove_downloads and self._config.distributed.rank == 0:
-            shutil.rmtree(download_path)
+            shutil.rmtree(download_path, ignore_errors=True)
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 061d9add..4a137fcf 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -30,7 +30,7 @@ theme:
     - content.code.copy
     # - content.code.select
     # - content.footnote.tooltips
-    # - content.tabs.link
+    - content.tabs.link
     - content.tooltips
     # - header.autohide
     # - navigation.expand
@@ -138,7 +138,7 @@ markdown_extensions:
   - pymdownx.tilde
 
 plugins:
-  - blog
+  # - blog
   - mkdocstrings:
       default_handler: python
       handlers:
@@ -149,30 +149,33 @@ plugins:
   - section-index
   - social:
       cards_layout_options:
-        color: #173a58
+        color: "#173a58"
   - git-revision-date-localized:
       enable_creation_date: true
   - git-committers:
       repository: ServiceNow/Fast-LLM
       branch: main
+  - bibtex:
+      bib_file: "docs/refs.bib"
 
 nav:
-  - 🏠 Home: index.md
-  - 🚀 Getting Started:
-    - 📜 License: license.md
-  - 🍳 Tutorial:
-    - tutorial/index.md
-    - 🚀 Getting started: tutorial/getting_started.md
-    - 💨 Load Mistral-7B: tutorial/prepare_mistral.md
-    - 📊 Prepare and load the dataset: tutorial/prepare_data.md
-    - 💨 Prepare the training configuration: tutorial/prepare_training.md
-    - 🌪 Launch and monitor training: tutorial/launch_training.md
-    - 🤗 Convert to Huggingface: tutorial/convert_to_huggingface.md
-  - 🗂️ Reference:
-    - reference/index.md
-  - 🧑‍💻 Developer Guide:
-    - developers/index.md
-    - 🛠️ How to contribute: developers/contribute.md
-  - 👥 Community:
-    - community/index.md
-    - 🫶 Feedback: community/feedback.md
+  - Welcome: index.md
+  - Get Started:
+    - Quick Start: quick-start.md
+    - Help: help.md
+    - Success Stories:
+      - StarCoder 2: success-stories/starcoder-2.md
+    - License: license.md
+  - Recipes:
+    - Data Preparation: recipes/data-preparation.md
+    - Train Llama 8B from scratch: recipes/train-llama-8b.md
+    - Continue training Llama 8B: recipes/continue-training-llama-8b.md
+    - Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md
+  - Reference:
+    - Configuration: reference/configuration.md
+  - Developers:
+    - Contributing: developers/contributing.md
+    - Style Guide: developers/style-guide.md
+    - Development Practices: developers/dev-practices.md
+  - About Us: about-us.md
+  - Join Us: join-us.md
diff --git a/setup.cfg b/setup.cfg
index 51f87ac5..5429dc91 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -55,6 +55,8 @@ DOCS =
     mkdocstrings[python]
     mkdocs-git-committers-plugin-2
     mkdocs-git-revision-date-localized-plugin
+    pypandoc_binary
+    mkdocs-bibtex
 
 [options.entry_points]
 console_scripts =