Clarify GradCache usage

mexicanamerican · Apr 19, 2024 · a122855 · a122855
1 parent 47b7fe6
commit a122855
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/README.md b/README.md
@@ -384,7 +384,7 @@ torchrun --nproc_per_node 1 \
 --attn cccc
 ```
 
-All arguments are explained in `training/arguments.py` or the [HF TrainingArguments documentation](https://hf.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) except for `nproc_per_node` which is the number of GPUs per node. For our actual training runs, we use accelerate to easily use multiple nodes and GPUs as well as slightly different settings (e.g. `--attn bbcc`). The scripts are all in `scripts/training`, for example `scripts/training/train_gritlm_8x7b.sh` was used for GritLM-8x7B. For models from the ablations, you can check their folder on the huggingface hub which contains a `training_args.bin` file with the arguments. You can also check all their arguments on the WandB: https://wandb.ai/muennighoff/gritlm.
+All arguments are explained in `training/arguments.py` or the [HF TrainingArguments documentation](https://hf.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) except for `nproc_per_node` which is the number of GPUs per node. For our actual training runs, we use accelerate to easily use multiple nodes and GPUs as well as slightly different settings (e.g. `--attn bbcc`). The scripts are all in `scripts/training`, for example `scripts/training/train_gritlm_8x7b.sh` was used for GritLM-8x7B. For models from the ablations, you can check their folder on the huggingface hub which contains a `training_args.bin` file with the arguments. You can also check all their arguments on the WandB: https://wandb.ai/muennighoff/gritlm. Also note that if using GradCache, you need to [use the one in this repository](https://github.com/ContextualAI/gritlm/tree/main/gritlm/training/GradCache).
 
 ### Evaluation