Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions sdk/python/jobs/grpo/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,40 @@
### This directory hosts an example of running GRPO on AzureML
### This directory hosts an example to train an Instruct model into a Reasoning model on AML using Group Relative Policy Optimization (GRPO) - a RFT technique

#### Introduction:
In this repo, one will find a notebook (launch_grpo_command_job-med-mcqa-commented.ipynb) that details an end-to-end process of fine-tuning the **Qwen2.5-7B-Instruct** model into a **reasoning model** using medical data on **Azure ML**. Qwen2.5-7B-Instruct is an instruction-tuned large language model developed by Alibaba Cloud, based on their Qwen2.5-7B foundation model. \
It is optimized for following human instructions across a wide range of tasks, such as question answering, code generation, and language understanding. \
In this walkthrough, one will learn how to enhance the model's reasoning capabilities using **Reinforced Fine-Tuning (RFT)** techniques, with a focus on **GRPO (**G**roup **R**elative **P**olicy **O**ptimization)**.

<img src="images/agenda.png" alt="image.png" width="1000"/>

#### About the GRPO Trainer
<div style="display: flex; align-items: flex-start; gap: 32px;">
<div style="flex: 1;">
<p>The reasoning model training process typically includes three key components:</p>
<ul>
<li><strong>Sampler</strong> – Generates multiple candidate responses from the model</li>
<li><strong>Reward Function</strong> – Evaluates and scores each response based on criteria like accuracy or structure</li>
<li><strong>Trainer</strong> – Updates the model to reinforce high-quality outputs</li>
</ul>
<p>
In this example, the <strong>GRPO Trainer</strong> (an implementation from TRL library) is used for training Qwen2.5-7B-Instruct model into a reasoning model.
</p>
<br>
<p>
<strong>GRPO</strong> (<strong>G</strong>roup <strong>R</strong>elative <strong>P</strong>olicy <strong>O</strong>ptimization) is a reinforcement learning technique that:
</p>
<ul>
<li><em>Compares</em> multiple answers within a group</li>
<li><em>Rewards</em> the best-performing outputs</li>
<li><em>Penalizes</em> poor ones</li>
<li>Applies careful updates to <em>avoid sudden changes</em></li>
</ul>
</div>
<div style="flex: 1; display: flex; justify-content: center;">
<img src="images/training_loop.png" alt="Training Loop" style="max-width:100%; width: 600px;"/>
</div>
</div>


#### Repo structure:
- aml_setup.py: AzureML specific code relating to creation of dataset, model and environment. The workspace config is located here, make changes to this file before running the notebook.
Expand All @@ -10,4 +46,4 @@
- grpo_trainer_callbacks.py: Code which converts the output models to MLflow model format. This conversation simplifies deployment as AzureML automatically generated inferencing environment and server for models in this format.
- grpo_trainer_rewards.py: Rewards for the training, these are python functions which take an LLM response and grade it for accuracy and format.

Additionally, [this video](https://youtu.be/YOm_IQt3YWw?si=5nZzyy-PZyP9XFSU&t=1344) offers an overview based on the contents of this repository.
Additionally, [this video](https://youtu.be/YOm_IQt3YWw?si=5nZzyy-PZyP9XFSU&t=1344) offers an overview based on the contents of this repository.
Loading