Skip to content

Commit

Permalink
support GRPO (#3022)
Browse files Browse the repository at this point in the history
* init

* init

* update factory

* compute_loss

* fix args

* fix reward

* prepare_inputs

* fix prepare_inputs

* fix

* reward model

* remove unused columns

* fix reward funcs and training scripts

* update training script

* vllm

* vllm

* fix

* fix batch

* update trl

* fix vllm engine

* state_dict

* update

* update

* fix

* update

* update

* update

* fix ddp

* update

* update

* fix infer

* fix

* fix vllm

* fix

* update orms

* fix

* fix

* fix

* fix

* fix lint

* update

* update

* fix template

* fix vllm grpo

* fix device

* fix device

* fix device

* update

* support mllm

* doc

* fix

* update readme

* fix

* compat trl<0.15

* recover is_mp_ddp

* fix

* fix

* doc

* update

* fix

* log completions

* readme

* doc update

* update scripts

* readme

* fix grpo.py

---------

Co-authored-by: hjh <hjh@U-413PHRX2-2043.local>
Co-authored-by: hongzhang.hz <zh461848@alibaba-inc.com>
Co-authored-by: hjh <hujinghan.hjh@alibaba-inc.com>
Co-authored-by: Jintao Huang <huangjintao.hjt@alibaba-inc.com>
Co-authored-by: Jintao <huangjintao@mail.ustc.edu.cn>
  • Loading branch information
6 people authored Feb 10, 2025
1 parent 47e0dd2 commit 9a77a8e
Show file tree
Hide file tree
Showing 40 changed files with 1,006 additions and 94 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.10
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: '3.10'
- name: Install pre-commit hook
run: |
pip install pre-commit
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ jobs:
#if: startsWith(github.event.ref, 'refs/tags')
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.10
uses: actions/setup-python@v2
with:
python-version: '3.8'
python-version: '3.10'
- name: Install wheel
run: pip install wheel packaging
- name: Build ModelScope Swift
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ You can contact us and communicate with us by adding our group:


## 🎉 News

- 🔥 2025.02.12: Support for GRPO(Group Relative Policy Optimization) algorithm for llm and mllm, document can be found in [here](docs/source_en/Instruction/GRPO.md)
- 🎁 2025.02.10: SWIFT support the fine-tuning of embedding models,please check the [training script](examples/train/embedding/train.sh)
- 🎁 2025.01.23: SWIFT support the `sample` command, this is a very important feature for complex CoT and RFT. Meanwhile, we support an [Reinforced Fine-tuning script](docs/source_en/Instruction/Reinforced_Fine_tuning.md).
- 🎁 2024.12.04: **SWIFT3.0** major version update. Please check the [Release Notes and Changes](https://swift.readthedocs.io/en/latest/Instruction/ReleaseNote3.0.html).
Expand Down Expand Up @@ -108,13 +108,13 @@ Running Environment:

| | Range | Recommended | Notes |
| ------------ | -------------------- | ----------- | ----------------------------------------- |
| python | >=3.8 | 3.10 | |
| python | >=3.9 | 3.10 | |
| cuda | | cuda12 | No need to install if using CPU, NPU, MPS |
| torch | >=2.0 | | |
| transformers | >=4.33 | 4.48.2 | |
| transformers | >=4.33 | 4.48.3 | |
| modelscope | >=1.19 | | |
| peft | >=0.11.0,<0.15.0 | | |
| trl | >=0.13,<0.15 | 0.14.0 | RLHF |
| trl | >=0.13,<0.16 | 0.14.0 | RLHF |
| vllm | >=0.5.1 | 0.6.5 | Inference/Deployment/Evaluation |
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 | Inference/Deployment/Evaluation |
| deepspeed | | 0.14.5 | Training |
Expand Down Expand Up @@ -253,6 +253,7 @@ Supported Training Methods:
| Pre-training | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) |||||
| Instruction Supervised Fine-tuning | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
| DPO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
| GRPO Training | []((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo.sh)) |||||
| Reward Model Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) ||
| PPO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) ||
| KTO Training || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |
Expand Down
32 changes: 17 additions & 15 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
- **模型量化**:支持AWQ、GPTQ和BNB的量化导出,导出的模型支持使用vLLM/LmDeploy推理加速,并支持继续训练。

## 🎉 新闻
- 🔥 2025.02.12: 支持GRPO(Group Relative Policy Optimization) 训练算法,训练脚本可以在[这里](docs/source/Instruction/GRPO.md)找到
- 🎁 2025.02.10: SWIFT支持了embedding模型的微调,请查看[训练脚本](examples/train/embedding/train.sh)
- 🎁 2025.01.23: SWIFT支持了`sample`命令, 这是一个对CoT和RFT非常重要的命令。同时, 我们支持了一个[强化微调脚本](docs/source/Instruction/强化微调.md)
- 🎁 2024.12.04: **SWIFT3.0**大版本更新。请查看[发布说明和更改](https://swift.readthedocs.io/zh-cn/latest/Instruction/ReleaseNote3.0.html)
Expand Down Expand Up @@ -102,13 +103,13 @@ pip install -e .

| | 范围 | 推荐 | 备注 |
| ------ | ----- | ---- | --|
| python | >=3.8 | 3.10 ||
| python | >=3.9 | 3.10 ||
| cuda | | cuda12 |使用cpu、npu、mps则无需安装|
| torch | >=2.0 | ||
| transformers | >=4.33 | 4.48.2 ||
| transformers | >=4.33 | 4.48.3 ||
| modelscope | >=1.19 | ||
| peft | >=0.11.0,<0.15.0 | ||
| trl | >=0.13,<0.15 | 0.14.0 |RLHF|
| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
| vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
| deepspeed | | 0.14.5 |训练|
Expand Down Expand Up @@ -239,18 +240,19 @@ print(f'response: {resp_list[0].choices[0].message.content}')
### 训练
支持的训练方法:

| 方法 | 全参数 | LoRA | QLoRA | Deepspeed | 多模态 |
|---------------| ------ |---------------------------------------------------------------------------------------------| ----- | ------ | --- |
| 预训练 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) |||||
| 指令监督微调 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
| DPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
| 奖励模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) ||
| PPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) ||
| KTO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |
| CPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) ||
| SimPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) ||
| ORPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) ||
| 分类模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5/sft.sh) ||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/sft.sh) |
| 方法 | 全参数 | LoRA | QLoRA | Deepspeed | 多模态 |
| ------ | ------ | ---- | ----- | ------ | ------ |
| 预训练 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) |||||
| 指令监督微调 | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
| DPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
| GRPO训练 | []((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo.sh)) |||||
| 奖励模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) ||
| PPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) ||
| KTO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |
| CPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) ||
| SimPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) ||
| ORPO训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) ||
| 分类模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5/sft.sh) ||| [](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/sft.sh) |
| Embedding模型训练 || [](https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train.sh) ||||


Expand Down
2 changes: 1 addition & 1 deletion docs/source/Customization/自定义数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ query-response格式:
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
```

#### PPO
#### PPO & GRPO

```jsonl
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}
Expand Down
6 changes: 3 additions & 3 deletions docs/source/GetStarted/SWIFT安装.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,13 @@ pip install ms-swift==2.*

| | 范围 | 推荐 | 备注 |
| ------ | ----- | ---- | --|
| python | >=3.8 | 3.10 ||
| python | >=3.9 | 3.10 ||
| cuda | | cuda12 |使用cpu、npu、mps则无需安装|
| torch | >=2.0 | ||
| transformers | >=4.33 | 4.48.2 ||
| transformers | >=4.33 | 4.48.3 ||
| modelscope | >=1.19 | ||
| peft | >=0.11.0,<0.15.0 | ||
| trl | >=0.13,<0.15 | 0.14.0 |RLHF|
| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
| vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
| deepspeed | | 0.14.5 |训练|
Expand Down
81 changes: 81 additions & 0 deletions docs/source/Instruction/GRPO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# GRPO

论文地址

[DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)
[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)

环境安装
```bash
pip install math_verify # reward function
pip install git+https://github.com/huggingface/trl.git # trl >=0.15.0.dev0
```


超参数
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_eval_batch_size * nproc_per_node 整除
- max_completion_length: 采样生成的最大长度,默认为512
- reward_funcs: 奖励函数,根据模型生成结果进行打分,内置accuracy和format两个rule-based函数,详细见 swift/plugin/orm.py 文件
- use_vllm: 是否使用vLLM作为采样的生成后端,默认为False,建议使用加快训练速度
- vllm_device: 设置vLLM部署的设备,默认为`auto`, 即未被使用的第一张显卡,使用`cuda:x`来设置特定的卡。
- vllm_gpu_memory_utilization: vLLM透传参数
- vllm_max_model_len: vLLM透传参数
- reward_model: 同model, 使用奖励模型作为奖励函数,与reward_funcs至少需要指定一个

建议使用vLLM作为采样后端加速训练,多卡环境下,建议单独设置一张显卡用于部署vLLM,此时进程数应等于显卡数减一

## 运行脚本
多卡vLLM
```bash
# nproc_per_node 比显卡数少一,vLLM默认单独部署于最后一张卡,即卡7
nproc_per_node=7 \
MASTER_PORT=29500 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=$nproc_per_node \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-Math-7B \
--reward_funcs accuracy format \
--vllm_device auto \
--train_type full \
--torch_dtype bfloat16 \
--dataset 'AI-MO/NuminaMath-TIR' \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--learning_rate 2e-5 \
--gradient_accumulation_steps 8 \
--save_total_limit 2 \
--logging_steps 5 \
--dataset_num_proc 4 \
--num_generations 7 \
--use_vllm true \
--system 'swift/example/train/grpo/prompt.txt' \
--vllm_gpu_memory_utilization 0.8 \
--deepspeed zero3
```

单卡vLLM
```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-Math-7B \
--reward_funcs accuracy format \
--vllm_device auto \
--train_type full \
--torch_dtype bfloat16 \
--dataset 'AI-MO/NuminaMath-TIR' \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--learning_rate 2e-5 \
--gradient_accumulation_steps 8 \
--save_total_limit 2 \
--logging_steps 5 \
--dataset_num_proc 4 \
--num_generations 2 \
--use_vllm true \
--system 'swift/example/train/grpo/prompt.txt' \
--vllm_gpu_memory_utilization 0.3
```
9 changes: 8 additions & 1 deletion docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
- download_mode: 数据集下载模式,包含`reuse_dataset_if_exists``force_redownload`,默认为reuse_dataset_if_exists
- columns: 用于对数据集进行列映射,使数据集满足AutoPreprocessor可以处理的样式,具体查看[这里](../Customization/自定义数据集.md)。你可以传入json字符串,例如:`'{"text1": "query", "text2": "response"}'`,默认为None。
- strict: 如果为True,则数据集只要某行有问题直接抛错,否则会丢弃出错数据样本。默认False
- remove_unused_columns: 是否删除数据集中不被使用的列,默认为True
- 🔥model_name: 仅用于自我认知任务,只对`swift/self-cognition`数据集生效,替换掉数据集中的`{{NAME}}`通配符。传入模型中文名和英文名,以空格分隔,例如:`--model_name 小黄 'Xiao Huang'`。默认为None
- 🔥model_author: 仅用于自我认知任务,只对`swift/self-cognition`数据集生效,替换掉数据集中的`{{AUTHOR}}`通配符。传入模型作者的中文名和英文名,以空格分隔,例如:`--model_author '魔搭' 'ModelScope'`。默认为None
- custom_dataset_info: 自定义数据集注册的json文件路径,参考[自定义数据集](../Customization/自定义数据集.md)。默认为`[]`
Expand Down Expand Up @@ -110,7 +111,6 @@
- lr_scheduler_kwargs: lr_scheduler其他参数。默认为None
- 🔥gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None
- report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb``--report_to all`
- remove_unused_columns: 是否删除数据集中不被使用的列,默认为False
- logging_first_step: 是否记录第一个step的日志,默认为True
- logging_steps: 日志打印间隔,默认为5
- predict_with_generate: 验证时使用生成式的方式,默认为False。
Expand Down Expand Up @@ -331,6 +331,13 @@ RLHF参数继承于[训练参数](#训练参数)
- simpo_gamma: SimPO算法中的reward margin项,论文建议设置为0.5-1.5,默认为`1.`
- desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$,默认为`1.`
- undesirable_weight: KTO算法中对undesirable response的loss权重 $\lambda_U$,默认为`1.`
- num_generations: GRPO算法中的G值,默认为8
- max_completion_length: GRPO算法中的最大生成长度,默认为512
- reward_funcs: GRPO算法奖励函数,可选项为`accuracy``format`,见swift/plugin/orm.py
- use_vllm: 是否使用vLLM作为GRPO生成的backend,默认为False
- vllm_device: 设置vLLM部署的设备,比如部署在卡0上,则`cuda:1`, 默认为`auto`, 即使用最后一张卡
- vllm_gpu_memory_utilization: vllm透传参数
- vllm_max_model_len: vllm透传参数
- loss_scale: 覆盖模板参数,默认为'last_round'

#### PPO参数
Expand Down
Loading

0 comments on commit 9a77a8e

Please sign in to comment.