Add grpo job example #3589

sharvin2187 · 2025-05-21T16:55:58Z

Description

Adds an example to run a grpo job using AzureML.
Includes sample datasets for the job, these have been shortened to 50 samples only as the repo has a policy that PR-size < 2MB. We can gradually add the extra data back in.

Checklist

I have read the contribution guidelines.
- General
- SDK
- CLI
I have coordinated with the docs team (mldocs@microsoft.com) if this PR deletes files or changes any file names or file extensions.
Pull request includes test coverage for the included changes.
This notebook or file is added to the CODEOWNERS file, pointing to the author or the author's team.

Copilot

Pull Request Overview

This PR adds a GRPO job example with trainer scripts, configs, callbacks, environment setup, and sample datasets for a medical MCQA task in AzureML.

Adds reward functions and trainer code for reasoning tasks
Introduces YAML/JSON configs and Docker/requirements for environment
Includes sample datasets (50 records each) and AML setup script

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
sdk/python/jobs/grpo/src/grpo_trainer_rewards.py	Implements `format` and `accuracy` reward functions
sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py	Main GRPO training script with CLI arg parsing
sdk/python/jobs/grpo/aml_setup.py	AML workspace setup for dataset, model, compute, env
sdk/python/jobs/grpo/src/grpo_trainer_callbacks.py	Callback to save HF-transformers models in MLflow
sdk/python/jobs/grpo/src/grpo_trainer_config.yaml	Trainer configuration for vLLM, reward weighting
sdk/python/jobs/grpo/environment/Dockerfile	Docker image build and package installs
sdk/python/jobs/grpo/environment/requirements.txt	Python dependencies
sdk/python/jobs/grpo/datasets/med_mcqa/*.jsonl	Sample dataset splits for training/eval/testing

Comments suppressed due to low confidence (4)

sdk/python/jobs/grpo/aml_setup.py:13

The Model import at line 13 duplicates the earlier import from azure.ai.ml.entities. Remove one of the imports to prevent confusion and improve code clarity.

from mlflow.models import Model

sdk/python/jobs/grpo/aml_setup.py:100

Fix the typo in the comment: change "falsh" to "flash".

# This job requires falsh attention and needs A100 or H100 GPUs.

sdk/python/jobs/grpo/src/grpo_trainer_rewards.py:14

There are no tests covering the new reward functions (format_reward and _medmcqa_match_fn). Consider adding unit tests to validate correct behavior and edge cases.

def format_reward(completions, **kwargs):

sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py:186

The SaveMLflowModelCallback constructor does not accept a preprocessor argument but is called with it here, leading to a runtime TypeError. Change the argument name or update the callback signature to match.

preprocessor=tokenizer,

sdk/python/jobs/grpo/src/grpo_trainer_rewards.py

sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

This PR adds a GRPO example for running a policy optimization job on AzureML, including new reward functions, callbacks, dataset samples, environment specs, and setup scripts.

Introduce reward computation and accuracy matching functions.
Provide a full training script and AML workspace setup.
Add sample datasets, Dockerfile, and requirements for reproducible runs.

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
sdk/python/jobs/grpo/src/grpo_trainer_rewards.py	Implemented format and accuracy reward functions
sdk/python/jobs/grpo/src/grpo_trainer_config.yaml	Added YAML config for GRPO training parameters
sdk/python/jobs/grpo/src/grpo_trainer_callbacks.py	Added MLflow save callback for model artifact management
sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py	Example training script using GRPOTrainer
sdk/python/jobs/grpo/environment/requirements.txt	Pinned Python package dependencies
sdk/python/jobs/grpo/environment/Dockerfile	Docker image build instructions
sdk/python/jobs/grpo/datasets/med_mcqa/validation.jsonl	Sample validation split for medical MCQA
sdk/python/jobs/grpo/datasets/med_mcqa/train.jsonl	Sample training split for medical MCQA
sdk/python/jobs/grpo/datasets/med_mcqa/test.jsonl	Sample test split for medical MCQA
sdk/python/jobs/grpo/aml_setup.py	Script to register dataset, model, compute, environment in AML workspace

Comments suppressed due to low confidence (2)

sdk/python/jobs/grpo/src/grpo_trainer_rewards.py:14

These new reward functions lack associated unit tests; consider adding tests to verify that format_reward, _medmcqa_match_fn, and accuracy produce expected outputs for a variety of inputs.

def format_reward(completions, **kwargs):

sdk/python/jobs/grpo/datasets/med_mcqa/validation.jsonl:1

The dataset prompt contains a typo (Murphy&;s); it should be Murphy's sign for clarity and correctness.

"Murphy&;s sign is seen in?"

sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py

sdk/python/jobs/grpo/environment/Dockerfile

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

yeshsurya · 2025-05-27T07:20:00Z

It would be good to add path in CODEOWNERs file as well

sdk/python/jobs/grpo/aml_setup.py

sdk/python/jobs/grpo/images/agenda.png

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

sharvin2187 · 2025-05-28T07:43:55Z

Adding a top-level README file to help users navigate the repo will be useful. Can also have pointers to the demo video on Build website on that README page.

Done.

babu-namburi

LGTM

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

…d.ipynb Co-authored-by: Gayatri Penumetsa <181455625+gpenumetsa-msft@users.noreply.github.com>

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

…d.ipynb Co-authored-by: Gayatri Penumetsa <181455625+gpenumetsa-msft@users.noreply.github.com>

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb

…d.ipynb Co-authored-by: Gayatri Penumetsa <181455625+gpenumetsa-msft@users.noreply.github.com>

anoopkunchukuttan

In launch_grpo_command_job-med-mcqa-commented.ipynb, when describing the setup (data/model) Section 1- add some more details. Here is the snippet you can use:

The Azure Machine Learning (AML) **setup process is encapsulated** into a script that provisions all required resources in the workspace. \
By the end of the setup, the AML workspace will be fully configured with the below resources: 

- **Dataset** : [MedMCQA](https://medmcqa.github.io): A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. We use a modified version of the MedMCQA dataset, restricting our experiments to question/answer pairs having only a single correct answer. The modified dataset used in the demo can be found in `datasets/med_mcqa`
- **Model** : [Qwen2_5-7B-Instruct_base](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- **Compute Cluster**: STANDARD_ND96ISR_H100_V5 cluster with at least 2 nodes 
- **Environment**: Is designed for GRPO specific large-scale, distributed training and inference of reasoning models using Azure Machine Learning, TRL, DeepSpeed, vLLM, and LoRA.

sharvin2187 · 2025-05-28T08:51:55Z

In launch_grpo_command_job-med-mcqa-commented.ipynb, when describing the setup (data/model) Section 1- add some more details. Here is the snippet you can use:

The Azure Machine Learning (AML) **setup process is encapsulated** into a script that provisions all required resources in the workspace. \
By the end of the setup, the AML workspace will be fully configured with the below resources: 

- **Dataset** : [MedMCQA](https://medmcqa.github.io): A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. We use a modified version of the MedMCQA dataset, restricting our experiments to question/answer pairs having only a single correct answer. The modified dataset used in the demo can be found in `datasets/med_mcqa`
- **Model** : [Qwen2_5-7B-Instruct_base](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- **Compute Cluster**: STANDARD_ND96ISR_H100_V5 cluster with at least 2 nodes 
- **Environment**: Is designed for GRPO specific large-scale, distributed training and inference of reasoning models using Azure Machine Learning, TRL, DeepSpeed, vLLM, and LoRA.

Done.

gpenumetsa-msft

Reviewed and suggested changes. LGTM.

Sharvin Jondhale added 4 commits May 26, 2025 23:25

Add grpo job example

322b9a9

Replace client info with generics

37e2ebc

Add missing copyright

ed2d874

newline at end of aml_setup.py

7e212f5

sharvin2187 force-pushed the shjondhale/build-grpo branch from 7193c51 to 7e212f5 Compare May 26, 2025 17:56

Fix black formatting issues and reduce dataset

a710dd2

yeshsurya requested review from Copilot and yeshsurya May 27, 2025 05:18

yeshsurya previously approved these changes May 27, 2025

View reviewed changes

Copilot AI reviewed May 27, 2025

View reviewed changes

sdk/python/jobs/grpo/src/grpo_trainer_rewards.py Outdated Show resolved Hide resolved

sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py Outdated Show resolved Hide resolved

Update sdk/python/jobs/grpo/src/grpo_trainer_rewards.py

eda9d92

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sharvin2187 dismissed yeshsurya’s stale review via eda9d92 May 27, 2025 06:33

Update sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py

dfa700b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sharvin2187 requested review from Copilot and yeshsurya May 27, 2025 06:35

Copilot AI reviewed May 27, 2025

View reviewed changes

sdk/python/jobs/grpo/src/BldDemo_Reasoning_Train.py Outdated Show resolved Hide resolved

sdk/python/jobs/grpo/environment/Dockerfile Show resolved Hide resolved

babu-namburi reviewed May 27, 2025

View reviewed changes

sdk/python/jobs/grpo/launch_grpo_command_job-med-mcqa-commented.ipynb Show resolved Hide resolved

Merge branch 'main' into shjondhale/build-grpo

e088bf6

gpenumetsa-msft suggested changes May 28, 2025

View reviewed changes

sdk/python/jobs/grpo/aml_setup.py Show resolved Hide resolved