Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuation from #49 #50

Merged
merged 103 commits into from
Dec 2, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
10084f5
misc updates
tscholak Oct 22, 2024
91cd526
revamp landing page
tscholak Oct 22, 2024
df5e09d
add about us section
tscholak Oct 22, 2024
585bb89
add developers corner
tscholak Oct 22, 2024
c6c2a29
Merge branch 'main' of github.com:ServiceNow/Fast-LLM into tscholak/i…
tscholak Oct 22, 2024
62a3b22
add docs README
tscholak Oct 22, 2024
257ba2d
improve landing page
tscholak Oct 22, 2024
a99d56d
improve landing page
tscholak Oct 22, 2024
dc8e1e6
improve landing page
tscholak Oct 22, 2024
4b47d51
add cost-efficiency comparison
tscholak Oct 22, 2024
fb7d3cc
refinements
tscholak Oct 22, 2024
91262bb
refinements
tscholak Oct 22, 2024
293d3b6
refinements
tscholak Oct 23, 2024
efde9b1
refinements
tscholak Oct 23, 2024
74b8ea7
linting
tscholak Oct 23, 2024
82adb54
rework cost efficiency comparison
tscholak Oct 23, 2024
f2beb4c
rework cost efficiency comparison
tscholak Oct 23, 2024
9a2397d
add devenv
tscholak Oct 23, 2024
e4230ea
add devenv
tscholak Oct 23, 2024
a1fa251
revamp structure
tscholak Oct 23, 2024
8903632
Merge remote-tracking branch 'origin/main' into tscholak/improve-docs
tscholak Oct 23, 2024
9f887c1
add quick-start guide
tscholak Oct 26, 2024
4f40ba9
Merge branch 'main' of github.com:ServiceNow/Fast-LLM into tscholak/i…
tscholak Oct 26, 2024
0ab5b62
add prepare-dataset script
tscholak Oct 27, 2024
79b8401
rewrite quick-start guide
tscholak Oct 27, 2024
8b6ef7b
rewrite quick-start guide
tscholak Oct 27, 2024
e04d5eb
add support for distributed data preparation
tscholak Oct 27, 2024
4f68378
rewrite quick-start guide
tscholak Oct 27, 2024
5cb0754
add help page
tscholak Oct 27, 2024
7b93dce
add help page
tscholak Oct 28, 2024
b79129c
add starcoder2 success story
tscholak Oct 28, 2024
4369066
add starcoder2 success story
tscholak Oct 28, 2024
2262c65
rewrite quick-start guide
tscholak Oct 28, 2024
60cc57a
add disclaimer
tscholak Oct 29, 2024
6a76020
add build instructions
tscholak Oct 29, 2024
13e29bc
Update README.md
hughesthe1st Oct 30, 2024
f72ac0e
Update index.md
hughesthe1st Oct 31, 2024
5a43ed5
Update index.md
hughesthe1st Oct 31, 2024
bff6506
Update index.md
hughesthe1st Oct 31, 2024
271d9d1
Update index.md
hughesthe1st Oct 31, 2024
7c9bc15
Merge remote-tracking branch 'origin/main' into tscholak/improve-docs
tscholak Oct 31, 2024
8216e58
Merge branch 'tscholak/improve-docs' of github.com:ServiceNow/Fast-LL…
tscholak Oct 31, 2024
c0b8959
remove unused blog
tscholak Oct 31, 2024
431eefa
add markdownlint
tscholak Oct 31, 2024
f11bbdf
add markdownlint
tscholak Oct 31, 2024
045dcca
Merge remote-tracking branch 'origin/main' into tscholak/improve-docs
tscholak Nov 3, 2024
43869e5
separate md linting for / and /docs
tscholak Nov 3, 2024
dd35c50
wip
tscholak Nov 5, 2024
f6e163f
wip
tscholak Nov 6, 2024
8acb4f3
wip
tscholak Nov 7, 2024
1ea3422
wip
tscholak Nov 7, 2024
f3eb0d6
wip
tscholak Nov 7, 2024
127cd43
wip
tscholak Nov 8, 2024
c382298
add datasets as dependency
tscholak Nov 8, 2024
8b49dfc
Merge remote-tracking branch 'origin/main' into tscholak/improve-docs
tscholak Nov 9, 2024
7304119
fix GPTMemmapDataset
tscholak Nov 9, 2024
47d453b
fix GPTMemmapDataset
tscholak Nov 9, 2024
bef3a72
add prepare-dataset command
tscholak Nov 10, 2024
0ffc75c
add prepare-dataset command
tscholak Nov 10, 2024
fda6386
add prepare-dataset command
tscholak Nov 10, 2024
acae7d9
add prepare-dataset command
tscholak Nov 10, 2024
eb7da59
add prepare-dataset command
tscholak Nov 10, 2024
b5ed2f0
add prepare-dataset command
tscholak Nov 10, 2024
c8f746a
only push latest tag for commits to main
tscholak Nov 10, 2024
0f80b76
add V100
tscholak Nov 10, 2024
e0f813c
use older generics syntax
tscholak Nov 10, 2024
b88c9d3
remove user and install Fast-LLM globally
tscholak Nov 10, 2024
4df12d9
simplify Dockerfile
tscholak Nov 11, 2024
3c5d4d9
wip
tscholak Nov 11, 2024
54af690
Merge remote-tracking branch 'origin/tscholak/prepare-dataset' into t…
tscholak Nov 11, 2024
3737bc0
improvements
tscholak Nov 11, 2024
4b6b195
add docstring
tscholak Nov 11, 2024
52a6f0b
use full imports
tscholak Nov 11, 2024
55b0b88
use full imports
tscholak Nov 11, 2024
1f975d2
use full imports
tscholak Nov 11, 2024
b665e91
don't load tokenizer during validatin
tscholak Nov 11, 2024
af1439e
Merge remote-tracking branch 'origin/main' into tscholak/prepare-dataset
tscholak Nov 11, 2024
e51677f
simplify
tscholak Nov 12, 2024
1f447bb
simplify
tscholak Nov 12, 2024
fb50c13
address comments
tscholak Nov 12, 2024
33067c8
address comments
tscholak Nov 12, 2024
dbc221c
address comments
tscholak Nov 12, 2024
a2ae051
address comments
tscholak Nov 12, 2024
5107302
Merge remote-tracking branch 'origin/tscholak/prepare-dataset' into t…
tscholak Nov 12, 2024
d68ce82
fix link
tscholak Nov 12, 2024
b2675a3
resolve merge conflicts
tscholak Nov 13, 2024
2fad03c
clean up
tscholak Nov 13, 2024
223bab0
clean up
tscholak Nov 13, 2024
94008ea
wip
tscholak Nov 13, 2024
9706971
Merge branch 'main' of github.com:ServiceNow/Fast-LLM into tscholak/i…
tscholak Nov 13, 2024
763d843
update dependencies
tscholak Nov 13, 2024
af3f1f0
wip
tscholak Nov 14, 2024
cc6ae8b
revert changes
tscholak Nov 14, 2024
e96c411
wip
tscholak Nov 14, 2024
77c7416
wip
tscholak Nov 14, 2024
08acf67
Improve quickstart guide
jlamypoirier Nov 19, 2024
4bd8fff
restore original structure
tscholak Nov 19, 2024
8031e9f
restore original structure
tscholak Nov 19, 2024
25277a1
resolve merge conflicts
tscholak Nov 27, 2024
9eaf01f
Merge branch 'tscholak/improve-docs' into torsten/improve_quickstart
tscholak Nov 27, 2024
48137d4
Merge remote-tracking branch 'origin/main' into torsten/improve_quick…
tscholak Nov 30, 2024
2ec96b4
wip
tscholak Dec 2, 2024
10deb5b
wip
tscholak Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add starcoder2 success story
  • Loading branch information
tscholak committed Oct 28, 2024
commit 4369066de0a84c9a0390c0b42d9d956719308713
12 changes: 2 additions & 10 deletions docs/about-us.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
---
title: About Us
hide:
- navigation
---

Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
Expand Down Expand Up @@ -30,13 +32,3 @@ Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://ww
- [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.

Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.

## Get Involved

Fast-LLM is an open-source project that thrives on collaboration. If you're a professional or researcher looking to contribute, there are many ways to get involved:

- **Code Contributions:** Dive into our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
- **Discussion and Ideas:** Join us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions) to share your insights, ask questions, or discuss new features.
- **Documentation and Tutorials:** Help us expand our [documentation](https://servicenow.github.io/Fast-LLM/), making it even more valuable for other professionals.

If you're serious about training large language models, Fast-LLM is here to help you push the limits. We look forward to your contributions and feedback as we continue to make LLM training faster and better.
4 changes: 2 additions & 2 deletions docs/help.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ If you're the type who loves configurations and tweaking every detail, the [**Co

We've got some excellent tutorials to help you get the most out of Fast-LLM:

- [**Quick-Start Guide**](quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
- [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.

- [**In-Action Guides**](in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
- [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.

---

Expand Down
4 changes: 3 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
---
title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
hide:
- navigation
---

Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
Expand Down Expand Up @@ -77,6 +79,6 @@ Fast-LLM is more than just software, it's a community. Get involved by exploring

## Getting Started

Ready to dive in? Check out our [quickstart guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
Ready to dive in? Check out our [quick-start guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.

For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
65 changes: 65 additions & 0 deletions docs/join-us.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Join Us
hide:
- navigation
---

Fast-LLM is an open-source project driven by a community of passionate contributors. Whether you're a researcher, developer, or AI enthusiast, there's a place for you to make a real impact on the future of large-scale AI training. Join us, dive in, and help shape the tools that push the boundaries of language model training. Here's how you can get involved:

---

## Stay in the Loop πŸ“¬

Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.

[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) ⭐ and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) πŸ‘€ the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.

---

## Code Contributions πŸ› 

Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [**Good First Issues**](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:

- [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
- [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code

Here's a quick overview of the process:

1. **Fork & Clone**: Start by forking the repo and cloning it to your machine.
2. **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
3. **Write Awesome Code**: Make your changes, document them, and follow our best practices.
4. **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.

[Explore the Developer's Corner for everything you need to get started!](developers)

---

## Feature Requests & Ideas πŸ’‘

Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.

[Share your thoughts on GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)

---

## Testing & Feedback πŸ”

Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.

[Report issues and share feedback on GitHub](https://github.com/ServiceNow/Fast-LLM/issues)

---

## Help & Support 🀝

Love helping others? Join our **GitHub Discussions** to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.

---

## Spread the Word πŸ“£

If you're excited about Fast-LLM, let the world know! Share on social media, write a blog post, or give a talk at your next tech meetup. Spreading the word helps grow our community and brings new talent into the project.

---

Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
58 changes: 32 additions & 26 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,7 @@ Let's create folders to store our input data and output results:
mkdir ~/inputs ~/results
```

## Step 3: Preparing the Training Data

For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!

We've got a script that'll download and preprocess the dataset for you. Run it like this:

!!! info inline end "What's Happening Here?"

This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.

```bash
python tools/prepare_dataset.py \
tokenizer_path_or_name="gpt2" \
dataset_name_or_path="openwebtext" \
dataset_split="train" \
dataset_field="text" \
output_dir="inputs" \
num_processes_load=4 \
num_processes_map=4 \
num_processes_save=4 \
num_tokens_per_shard=100000000
```

## Step 4: Choose Your Model
## Step 3: Choose Your Model

Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama, Mistral, and Qwen. For this tutorial, let's train the GPT-2 model from scratch with Fully Sharded Data Parallelism (FSDP). We'll grab a configuration file from Huggingface Hub and save it as `~/inputs/config.json`:

Expand Down Expand Up @@ -81,6 +58,34 @@ Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama

Smaller models like GPT-2 (124M) will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger models a shot!

## Step 4: Preparing the Training Data

For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!

We've got a script that'll download and preprocess the dataset for you. Run it like this:

```bash
docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-v ~/inputs:/app/inputs \
python tools/prepare_dataset.py \
tokenizer_path_or_name="gpt2" \
dataset_name_or_path="openwebtext" \
dataset_split="train" \
output_dir="inputs" \
num_processes_load=4 \
num_processes_map=4 \
num_processes_save=4 \
num_tokens_per_shard=100000000
```

!!! info "What's Happening Here?"

This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.

!!! warning "Tokenizer Mismatch"

If you chose a different model in Step 3, make sure to adjust the `tokenizer_path_or_name` parameter to match the model's tokenizer.

## Step 5: Set Up Your Training Configuration

Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
Expand Down Expand Up @@ -167,10 +172,11 @@ docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
-v ~/results:/app/results \
-e PYTHONHASHSEED=0 \
-e WANDB_API_KEY_PATH=/app/inputs/.wandb_api_key \
torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
torchrun --nproc_per_node=8 --no_python \
fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
```

!!! note
!!! note "Python Hash Seed"

Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes, which is crucial for parallel computations.

Expand Down
13 changes: 13 additions & 0 deletions docs/refs.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
@article{li2023starcoder,
title={Starcoder: may the source be with you!},
author={Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others},
journal={arXiv preprint arXiv:2305.06161},
year={2023}
}

@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
Loading