support GRPO (#3022)

* init * init * update factory * compute_loss * fix args * fix reward * prepare_inputs * fix prepare_inputs * fix * reward model * remove unused columns * fix reward funcs and training scripts * update training script * vllm * vllm * fix * fix batch * update trl * fix vllm engine * state_dict * update * update * fix * update * update * update * fix ddp * update * update * fix infer * fix * fix vllm * fix * update orms * fix * fix * fix * fix * fix lint * update * update * fix template * fix vllm grpo * fix device * fix device * fix device * update * support mllm * doc * fix * update readme * fix * compat trl<0.15 * recover is_mp_ddp * fix * fix * doc * update * fix * log completions * readme * doc update * update scripts * readme * fix grpo.py --------- Co-authored-by: hjh <hjh@U-413PHRX2-2043.local> Co-authored-by: hongzhang.hz <zh461848@alibaba-inc.com> Co-authored-by: hjh <hujinghan.hjh@alibaba-inc.com> Co-authored-by: Jintao Huang <huangjintao.hjt@alibaba-inc.com> Co-authored-by: Jintao <huangjintao@mail.ustc.edu.cn>
modelscope · Feb 10, 2025 · 9a77a8e · 9a77a8e
1 parent 47e0dd2
commit 9a77a8e
Show file tree

Hide file tree

Showing 40 changed files with 1,006 additions and 94 deletions.
diff --git a/.github/workflows/lint.yaml b/.github/workflows/lint.yaml
@@ -11,10 +11,10 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v2
-      - name: Set up Python 3.8
+      - name: Set up Python 3.10
         uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: '3.10'
       - name: Install pre-commit hook
         run: |
           pip install pre-commit

diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
@@ -15,10 +15,10 @@ jobs:
     #if: startsWith(github.event.ref, 'refs/tags')
     steps:
       - uses: actions/checkout@v2
-      - name: Set up Python 3.8
+      - name: Set up Python 3.10
         uses: actions/setup-python@v2
         with:
-          python-version: '3.8'
+          python-version: '3.10'
       - name: Install wheel
         run: pip install wheel packaging
       - name: Build ModelScope Swift

diff --git a/README.md b/README.md
@@ -78,7 +78,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
-
+- 🔥 2025.02.12: Support for GRPO(Group Relative Policy Optimization) algorithm for llm and mllm, document can be found in [here](docs/source_en/Instruction/GRPO.md)
 - 🎁 2025.02.10: SWIFT support the fine-tuning of embedding models，please check the [training script](examples/train/embedding/train.sh)。
 - 🎁 2025.01.23: SWIFT support the `sample` command, this is a very important feature for complex CoT and RFT. Meanwhile, we support an [Reinforced Fine-tuning script](docs/source_en/Instruction/Reinforced_Fine_tuning.md).
 - 🎁 2024.12.04: **SWIFT3.0** major version update. Please check the [Release Notes and Changes](https://swift.readthedocs.io/en/latest/Instruction/ReleaseNote3.0.html).
@@ -108,13 +108,13 @@ Running Environment:
 
 |              | Range                | Recommended | Notes                                     |
 | ------------ | -------------------- | ----------- | ----------------------------------------- |
-| python       | >=3.8                | 3.10        |                                           |
+| python       | >=3.9                | 3.10        |                                           |
 | cuda         |                      | cuda12      | No need to install if using CPU, NPU, MPS |
 | torch        | >=2.0                |             |                                           |
-| transformers | >=4.33               | 4.48.2      |                                           |
+| transformers | >=4.33               | 4.48.3      |                                           |
 | modelscope   | >=1.19               |             |                                           |
 | peft         | >=0.11.0,<0.15.0     |             |                                           |
-| trl          | >=0.13,<0.15         | 0.14.0      | RLHF                                      |
+| trl          | >=0.13,<0.16         | 0.14.0      | RLHF                                      |
 | vllm         | >=0.5.1              | 0.6.5       | Inference/Deployment/Evaluation           |
 | lmdeploy     | lmdeploy>=0.5,<0.6.5 | 0.6.4       | Inference/Deployment/Evaluation           |
 | deepspeed    |                      | 0.14.5      | Training                                  |
@@ -253,6 +253,7 @@ Supported Training Methods:
 | Pre-training                       | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh) | ✅                                                            | ✅                                                            | ✅                                                            | ✅                                                            |
 | Instruction Supervised Fine-tuning | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
 | DPO Training                       | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
+| GRPO Training | [✅]((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo.sh)) | ✅ | ✅ | ✅ | ✅ |
 | Reward Model Training              | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) | ✅                                                            |
 | PPO Training                       | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) | ❌                                                            |
 | KTO Training                       | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | ✅                                                            | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |

diff --git a/README_CN.md b/README_CN.md
@@ -74,6 +74,7 @@
 - **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🔥 2025.02.12: 支持GRPO(Group Relative Policy Optimization) 训练算法，训练脚本可以在[这里](docs/source/Instruction/GRPO.md)找到
 - 🎁 2025.02.10: SWIFT支持了embedding模型的微调，请查看[训练脚本](examples/train/embedding/train.sh)。
 - 🎁 2025.01.23: SWIFT支持了`sample`命令, 这是一个对CoT和RFT非常重要的命令。同时, 我们支持了一个[强化微调脚本](docs/source/Instruction/强化微调.md)。
 - 🎁 2024.12.04: **SWIFT3.0**大版本更新。请查看[发布说明和更改](https://swift.readthedocs.io/zh-cn/latest/Instruction/ReleaseNote3.0.html)。
@@ -102,13 +103,13 @@ pip install -e .
 
 |        | 范围  | 推荐 | 备注 |
 | ------ | ----- | ---- | --|
-| python | >=3.8 | 3.10 ||
+| python | >=3.9 | 3.10 ||
 | cuda |  | cuda12 |使用cpu、npu、mps则无需安装|
 | torch | >=2.0 |  ||
-| transformers | >=4.33 | 4.48.2 ||
+| transformers | >=4.33 | 4.48.3 ||
 | modelscope | >=1.19 |  ||
 | peft | >=0.11.0,<0.15.0 | ||
-| trl | >=0.13,<0.15 | 0.14.0 |RLHF|
+| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
 | vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
 | lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
 | deepspeed |  | 0.14.5 |训练|
@@ -239,18 +240,19 @@ print(f'response: {resp_list[0].choices[0].message.content}')
 ### 训练
 支持的训练方法：
 
-| 方法            | 全参数 | LoRA                                                                                        | QLoRA | Deepspeed | 多模态 |
-|---------------| ------ |---------------------------------------------------------------------------------------------| ----- | ------ | --- |
-| 预训练           |    [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh)    | ✅                                                                                           | ✅ | ✅ | ✅ |
-| 指令监督微调        |  [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh)     | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh)            | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
-| DPO训练         | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh)            | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
-| 奖励模型训练        | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh)             | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) | ✅ |
-| PPO训练         | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh)            | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) | ❌ |
-| KTO训练         | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh)            | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |
-| CPO训练         | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh)            | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) | ✅ |
-| SimPO训练       | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh)          | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) | ✅ |
-| ORPO训练        | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh)           | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) | ✅ |
-| 分类模型训练        | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5/sft.sh) | ✅ | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/sft.sh) |
+| 方法   | 全参数 | LoRA | QLoRA | Deepspeed | 多模态 |
+| ------ | ------ | ---- | ----- | ------ | ------ |
+| 预训练 |    [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/pretrain/train.sh)    | ✅ | ✅ | ✅ | ✅ |
+| 指令监督微调 |  [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/train.sh)     |   [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/lora_sft.sh)   | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/deepspeed) | [✅](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal) |
+| DPO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/dpo.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/dpo.sh) |
+| GRPO训练 | [✅]((https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/grpo.sh)) | ✅ | ✅ | ✅ | ✅ |
+| 奖励模型训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/rm.sh) | ✅ |
+| PPO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/ppo.sh) | ❌ |
+| KTO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/kto.sh) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/kto.sh) |
+| CPO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/cpo.sh) | ✅ |
+| SimPO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/simpo.sh) | ✅ |
+| ORPO训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/orpo.sh) | ✅ |
+| 分类模型训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5/sft.sh) | ✅ | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/sft.sh) |
 | Embedding模型训练 | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train.sh)     | ✅ | ✅ | ❌ |
 
 

diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -69,7 +69,7 @@ query-response格式：
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
 ```
 
-#### PPO
+#### PPO & GRPO
 
 ```jsonl
 {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}

diff --git a/docs/source/GetStarted/SWIFT安装.md b/docs/source/GetStarted/SWIFT安装.md
@@ -42,13 +42,13 @@ pip install ms-swift==2.*
 
 |        | 范围  | 推荐 | 备注 |
 | ------ | ----- | ---- | --|
-| python | >=3.8 | 3.10 ||
+| python | >=3.9 | 3.10 ||
 | cuda |  | cuda12 |使用cpu、npu、mps则无需安装|
 | torch | >=2.0 |  ||
-| transformers | >=4.33 | 4.48.2 ||
+| transformers | >=4.33 | 4.48.3 ||
 | modelscope | >=1.19 |  ||
 | peft | >=0.11.0,<0.15.0 | ||
-| trl | >=0.13,<0.15 | 0.14.0 |RLHF|
+| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
 | vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
 | lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
 | deepspeed |  | 0.14.5 |训练|

diff --git a/docs/source/Instruction/GRPO.md b/docs/source/Instruction/GRPO.md
@@ -0,0 +1,81 @@
+# GRPO
+
+论文地址
+
+[DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)
+[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
+
+环境安装
+```bash
+pip install math_verify # reward function
+pip install git+https://github.com/huggingface/trl.git # trl >=0.15.0.dev0
+```
+
+
+超参数
+- num_generations: 每个prompt采样的数量，论文中的G值，需要被 per_device_eval_batch_size * nproc_per_node 整除
+- max_completion_length: 采样生成的最大长度，默认为512
+- reward_funcs: 奖励函数，根据模型生成结果进行打分，内置accuracy和format两个rule-based函数，详细见 swift/plugin/orm.py 文件
+- use_vllm: 是否使用vLLM作为采样的生成后端，默认为False，建议使用加快训练速度
+- vllm_device: 设置vLLM部署的设备，默认为`auto`, 即未被使用的第一张显卡，使用`cuda:x`来设置特定的卡。
+- vllm_gpu_memory_utilization: vLLM透传参数
+- vllm_max_model_len: vLLM透传参数
+- reward_model: 同model, 使用奖励模型作为奖励函数，与reward_funcs至少需要指定一个
+
+建议使用vLLM作为采样后端加速训练，多卡环境下，建议单独设置一张显卡用于部署vLLM，此时进程数应等于显卡数减一
+
+## 运行脚本
+多卡vLLM
+```bash
+# nproc_per_node 比显卡数少一，vLLM默认单独部署于最后一张卡，即卡7
+nproc_per_node=7 \
+MASTER_PORT=29500 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen2.5-Math-7B \
+    --reward_funcs accuracy format \
+    --vllm_device auto \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --dataset 'AI-MO/NuminaMath-TIR' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --learning_rate 2e-5 \
+    --gradient_accumulation_steps 8 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --dataset_num_proc 4 \
+    --num_generations 7 \
+    --use_vllm true \
+    --system 'swift/example/train/grpo/prompt.txt' \
+    --vllm_gpu_memory_utilization 0.8 \
+    --deepspeed zero3
+```
+
+单卡vLLM
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen2.5-Math-7B \
+    --reward_funcs accuracy format \
+    --vllm_device auto \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --dataset 'AI-MO/NuminaMath-TIR' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --learning_rate 2e-5 \
+    --gradient_accumulation_steps 8 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --dataset_num_proc 4 \
+    --num_generations 2 \
+    --use_vllm true \
+    --system 'swift/example/train/grpo/prompt.txt' \
+    --vllm_gpu_memory_utilization 0.3
+```
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -46,6 +46,7 @@
 - download_mode: 数据集下载模式，包含`reuse_dataset_if_exists`和`force_redownload`，默认为reuse_dataset_if_exists
 - columns: 用于对数据集进行列映射，使数据集满足AutoPreprocessor可以处理的样式，具体查看[这里](../Customization/自定义数据集.md)。你可以传入json字符串，例如：`'{"text1": "query", "text2": "response"}'`，默认为None。
 - strict: 如果为True，则数据集只要某行有问题直接抛错，否则会丢弃出错数据样本。默认False
+- remove_unused_columns: 是否删除数据集中不被使用的列，默认为True
 - 🔥model_name: 仅用于自我认知任务，只对`swift/self-cognition`数据集生效，替换掉数据集中的`{{NAME}}`通配符。传入模型中文名和英文名，以空格分隔，例如：`--model_name 小黄 'Xiao Huang'`。默认为None
 - 🔥model_author: 仅用于自我认知任务，只对`swift/self-cognition`数据集生效，替换掉数据集中的`{{AUTHOR}}`通配符。传入模型作者的中文名和英文名，以空格分隔，例如：`--model_author '魔搭' 'ModelScope'`。默认为None
 - custom_dataset_info: 自定义数据集注册的json文件路径，参考[自定义数据集](../Customization/自定义数据集.md)。默认为`[]`
@@ -110,7 +111,6 @@
 - lr_scheduler_kwargs: lr_scheduler其他参数。默认为None
 - 🔥gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None
 - report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb`、`--report_to all`
-- remove_unused_columns: 是否删除数据集中不被使用的列，默认为False
 - logging_first_step: 是否记录第一个step的日志，默认为True
 - logging_steps: 日志打印间隔，默认为5
 - predict_with_generate: 验证时使用生成式的方式，默认为False。
@@ -331,6 +331,13 @@ RLHF参数继承于[训练参数](#训练参数)
 - simpo_gamma: SimPO算法中的reward margin项，论文建议设置为0.5-1.5，默认为`1.`
 - desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$，默认为`1.`
 - undesirable_weight: KTO算法中对undesirable response的loss权重 $\lambda_U$，默认为`1.`
+- num_generations: GRPO算法中的G值，默认为8
+- max_completion_length: GRPO算法中的最大生成长度，默认为512
+- reward_funcs: GRPO算法奖励函数，可选项为`accuracy`和`format`，见swift/plugin/orm.py
+- use_vllm: 是否使用vLLM作为GRPO生成的backend，默认为False
+- vllm_device: 设置vLLM部署的设备，比如部署在卡0上，则`cuda:1`, 默认为`auto`, 即使用最后一张卡
+- vllm_gpu_memory_utilization: vllm透传参数
+- vllm_max_model_len: vllm透传参数
 - loss_scale: 覆盖模板参数，默认为'last_round'
 
 #### PPO参数