Fixed some typos and added small details about trackio to docs (#3965)

sergiopaniego · web-flow · commit b7676d17016a · 2025-08-27T17:57:19.000+02:00
diff --git a/docs/source/customization.md b/docs/source/customization.md
@@ -1,6 +1,6 @@
 # Training customization
 
-TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.  Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.
+TRL is designed with modularity in mind so that users are able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.  Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.
 
 
 
diff --git a/docs/source/detoxifying_a_lm.md b/docs/source/detoxifying_a_lm.md
@@ -174,7 +174,7 @@ The evaluation script can be found [here](https://github.com/huggingface/trl/blo
 
 ### Discussions
 
-The results are quite promising, as we can see that the models are able to reduce the toxicity score of the generated text by an interesting margin. The gap is clear for `gpt-neo-2B` model but we less so for the `gpt-j-6B` model. There are several things we could try to improve the results on the largest model starting with training with larger `mini_batch_size` and probably allowing to back-propagate through more layers (i.e. use less shared layers).
+The results are quite promising, as we can see that the models are able to reduce the toxicity score of the generated text by an interesting margin. The gap is clear for `gpt-neo-2B` model but we see less so for the `gpt-j-6B` model. There are several things we could try to improve the results on the largest model starting with training with larger `mini_batch_size` and probably allowing to back-propagate through more layers (i.e. use less shared layers).
 
 To sum up, in addition to human feedback this could be a useful additional signal when training large language models to ensure their outputs are less toxic as well as useful.
 
diff --git a/docs/source/distributing_training.md b/docs/source/distributing_training.md
@@ -55,6 +55,6 @@ Having one model per GPU can lead to high memory usage, which may not be feasibl
 
 </Tip>
 
-## Multi-Nodes Training
+## Multi-Node Training
 
 We're working on a guide for multi-node training. Stay tuned! 🚀
diff --git a/docs/source/how_to_train.md b/docs/source/how_to_train.md
@@ -9,7 +9,7 @@ To address this, we recommend focusing on two key metrics first:
 **Mean Reward**: The primary goal is to maximize the reward achieved by the model during RL training.
 **Objective KL Divergence**: KL divergence (Kullback-Leibler divergence) measures the dissimilarity between two probability distributions. In the context of RL training, we use it to quantify the difference between the current model and a reference model. Ideally, we want to keep the KL divergence between 0 and 10 to ensure the model's generated text remains close to what the reference model produces.
 
-However, there are more metrics that can be useful for debugging, checkout the [logging section](logging).
+However, there are more metrics that can be useful for debugging, check out the [logging section](logging).
 
 ## Why Do We Use a Reference Model, and What's the Purpose of KL Divergence?
 
@@ -26,7 +26,7 @@ To address this issue, we add a penalty to the reward function based on the KL d
 
 ## What Is the Concern with Negative KL Divergence?
 
-If you generate text by purely sampling from the model distribution things work fine in general. But when you use the `generate` method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves `log_p_token_active < log_p_token_ref` we get negative KL-div. This can happen in a several cases:
+If you generate text by purely sampling from the model distribution things work fine in general. But when you use the `generate` method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves `log_p_token_active < log_p_token_ref` we get negative KL-div. This can happen in several cases:
 
 - **top-k sampling**: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
 - **min_length**: this ignores the EOS token until `min_length` is reached. thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
@@ -50,7 +50,7 @@ generation_kwargs = {
 }
 ```
 
-With these settings we usually don't encounter any issues. You can also experiments with other settings but if you encounter issues with negative KL-divergence try to go back to these and see if they persist.
+With these settings we usually don't encounter any issues. You can also experiment with other settings but if you encounter issues with negative KL-divergence try to go back to these and see if they persist.
 
 ## How can debug your own use-case?
 
@@ -60,6 +60,6 @@ Debugging the RL pipeline can be challenging due to its complexity. Here are som
 - **Start small, scale later**: Training large models can be very slow and take several hours or days until you see any improvement. For debugging this is not a convenient timescale so try to use small model variants during the development phase and scale up once that works. That being said you sometimes have to be careful as small models might not have the capacity to solve a complicated task either.
 - **Start simple**: Try to start with a minimal example and build complexity from there. Your use-case might require for example a complicated reward function consisting of many different rewards - try to use one signal first and see if you can optimize that and then add more complexity after that.
 - **Inspect the generations**: It's always a good idea to inspect what the model is generating. Maybe there is a bug in your post-processing or your prompt. Due to bad settings you might cut-off generations too soon. These things are very hard to see on the metrics but very obvious if you look at the generations.
-- **Inspect the reward model**: If you reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).
+- **Inspect the reward model**: If your reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).
 
 These are just a few tips that we find helpful - if you have more useful tricks feel free to open a PR to add them as well!
diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -7,7 +7,7 @@ Install the library with pip or [uv](https://docs.astral.sh/uv/):
 <hfoptions id="install">
 <hfoption id="uv">
 
-uv is a fast Rust-based Python package and project manager. Refer to [Installation](https://docs.astral.sh/uv/getting-started/installation/) for installation instructions), .
+uv is a fast Rust-based Python package and project manager. Refer to [Installation](https://docs.astral.sh/uv/getting-started/installation/) for installation instructions).
 
 ```bash
 uv pip install trl
diff --git a/docs/source/logging.md b/docs/source/logging.md
@@ -1,21 +1,21 @@
 # Logging
 
 As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
-By default, TRL trainers like [`PPOTrainer`] and [`GRPOTrainer`] save a lot of relevant information to supported experiment trackers like Weights & Biases (wandb) or TensorBoard.
+By default, TRL trainers like [`PPOTrainer`] and [`GRPOTrainer`] save a lot of relevant information to supported experiment trackers like Trackio, Weights & Biases (wandb) or TensorBoard.
 
 Upon initialization, pass the `report_to` argument to the respective configuration object (e.g., [`PPOConfig`] for `PPOTrainer`, or [`GRPOConfig`] for `GRPOTrainer`):
 
 ```python
 # For PPOTrainer
 ppo_config = PPOConfig(
     # ...,
-    report_to="wandb"  # or "tensorboard"
+    report_to="trackio"  # or "wandb" or "tensorboard"
 )
 
 # For GRPOTrainer
-grpc_config = GRPOConfig(
+grpo_config = GRPOConfig(
     # ...,
-    report_to="wandb"  # or "tensorboard"
+    report_to="trackio"  # or "wandb" or "tensorboard"
 )
 ```
 
diff --git a/docs/source/paper_index.md b/docs/source/paper_index.md
@@ -18,7 +18,7 @@ from trl import GRPOConfig
 training_args = GRPOConfig(
     importance_sampling_level="sequence",
     loss_type="grpo",
-    beta=0.0,  # GSPO set kl regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306 
+    beta=0.0,  # GSPO set KL regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306 
     epsilon=3e-4,  # GSPO paper (v2), section 5.1
     epsilon_high=4e-4,  # GSPO paper (v2), section 5.1
     gradient_accumulation_steps=1,
@@ -30,7 +30,7 @@ training_args = GRPOConfig(
 
 **📜 Paper**: https://huggingface.co/papers/2503.14476
 
-The DAPO algorithm, includes 5 key components:
+The DAPO algorithm includes 5 key components:
 
 - Overlong Filtering
 - Clip-Higher
@@ -165,7 +165,7 @@ training_args = GRPOConfig(
     temperature=0.99,
     num_completions=8, # = num_return_sequences in the paper
     num_iterations=1,  # = ppo_epochs in the paper
-    per_device_train_batch_size=4
+    per_device_train_batch_size=4,
     gradient_accumulation_steps=32,
     steps_per_generation=8,  # (rollout_batch_size*num_return_sequences) / (per_device_train_batch_size*gradient_accumulation_steps)
 )
diff --git a/docs/source/peft_integration.md b/docs/source/peft_integration.md
@@ -1,6 +1,6 @@
 # Examples of using peft with trl to finetune 8-bit models with Low Rank Adaption (LoRA)
 
-The notebooks and scripts in this examples show how to use Low Rank Adaptation (LoRA) to fine-tune models in a memory efficient manner. Most of PEFT methods supported in peft library but note that some PEFT methods such as Prompt tuning are not supported.
+The notebooks and scripts in these examples show how to use Low Rank Adaptation (LoRA) to fine-tune models in a memory efficient manner. Most of PEFT methods supported in peft library but note that some PEFT methods such as Prompt tuning are not supported.
 For more information on LoRA, see the [original paper](https://huggingface.co/papers/2106.09685).
 
 Here's an overview of the `peft`-enabled notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md
@@ -69,21 +69,21 @@ trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
 
 ### 📚 Learn More
 
-- [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer) - Complete SFT guide
-- [DPO Trainer](https://huggingface.co/docs/trl/dpo_trainer) - Preference alignment
-- [GRPO Trainer](https://huggingface.co/docs/trl/grpo_trainer) - Group relative policy optimization
-- [Training FAQ](https://huggingface.co/docs/trl/how_to_train) - Common questions
+- [SFT Trainer](sft_trainer) - Complete SFT guide
+- [DPO Trainer](dpo_trainer) - Preference alignment
+- [GRPO Trainer](grpo_trainer) - Group relative policy optimization
+- [Training FAQ](how_to_train) - Common questions
 
 ### 🚀 Scale Up
 
-- [Distributed Training](https://huggingface.co/docs/trl/distributing_training) - Multi-GPU setups
-- [Memory Optimization](https://huggingface.co/docs/trl/reducing_memory_usage) - Efficient training
-- [PEFT Integration](https://huggingface.co/docs/trl/peft_integration) - LoRA and QLoRA
+- [Distributed Training](distributing_training) - Multi-GPU setups
+- [Memory Optimization](reducing_memory_usage) - Efficient training
+- [PEFT Integration](peft_integration) - LoRA and QLoRA
 
 ### 💡 Examples
 
 - [Example Scripts](https://github.com/huggingface/trl/tree/main/examples) - Production-ready code
-- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials) - External guides
+- [Community Tutorials](community_tutorials) - External guides
 
 ## Troubleshooting
 
@@ -122,4 +122,4 @@ Try adjusting the learning rate:
 training_args = SFTConfig(learning_rate=2e-5)  # Good starting point
 ```
 
-For more help, see our [Training FAQ](how_to_train.md) or open an [issue on GitHub](https://github.com/huggingface/trl/issues).
+For more help, see our [Training FAQ](how_to_train) or open an [issue on GitHub](https://github.com/huggingface/trl/issues).
diff --git a/docs/source/reducing_memory_usage.md b/docs/source/reducing_memory_usage.md
@@ -88,7 +88,7 @@ Packing, introduced in [Raffel et al., 2020](https://huggingface.co/papers/1910.
     <img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/packing_2.png" alt="Packing" width="600"/>
 </div>
 
-Packing reduces padding by merging several sequences in one row when possible. We use an advanced method to be near-optimal in the way we pack the dataset. To enable packing, use `packing=True` and in the [`SFTConfig`].
+Packing reduces padding by merging several sequences in one row when possible. We use an advanced method to be near-optimal in the way we pack the dataset. To enable packing, use `packing=True` in the [`SFTConfig`].
 
 <Tip>
 
diff --git a/docs/source/sentiment_tuning.md b/docs/source/sentiment_tuning.md
@@ -1,6 +1,6 @@
 # Sentiment Tuning Examples
 
-The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
+The notebooks and scripts in these examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
 
 Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
 
diff --git a/docs/source/use_model.md b/docs/source/use_model.md
@@ -36,7 +36,7 @@ print(pipe("This movie was really")[0]["generated_text"])
 from peft import PeftConfig, PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
-base_model_name = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub"
+base_model_name = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
 adapter_model_name = "path/to/my/adapter"
 
 model = AutoModelForCausalLM.from_pretrained(base_model_name)
diff --git a/docs/source/using_llama_models.md b/docs/source/using_llama_models.md
@@ -36,7 +36,7 @@ The easiest way to achieve this is by continuing to train the language model wit
 The [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences) is enormous (over 10 million instructions), so we can easily train the language model on a subset of it.
 
 There is nothing special about fine-tuning the model before doing RLHF - it’s just the causal language modeling objective from pretraining that we apply here.
-To use the data efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.
+To use the data efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with an EOS token in between and cut chunks of the context size to fill the batch without any padding.
 
 ![chapter10_preprocessing-clm.png](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/chapter10_preprocessing-clm.png)
 

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`# Training customization`
`2`	`2`
`3`		`-TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques. Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.`
	`3`	`+TRL is designed with modularity in mind so that users are able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques. Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.`
`4`	`4`
`5`	`5`
`6`	`6`