update orpo

wj-Mcat · Jun 29, 2024 · 6b5c827 · 6b5c827
1 parent 5ee9883
commit 6b5c827
Showing 1 changed file with 39 additions and 1 deletion.
diff --git a/docs/04-paper-reading/2024/orpo.md b/docs/04-paper-reading/2024/orpo.md
@@ -32,4 +32,42 @@ ORPO 提出了一个非常创新的方法：将 模型对齐阶段 和 SFT阶段
 2. 效果更好？（有待社区验证，如果没有什么动静的话，那基本上就G 了）
 
 缺点：
-1. 增大了SFT阶段的复杂度，有一定的适配成本。
+1. 增大了SFT阶段的复杂度，有一定的适配成本。
+
+## 源码分析
+
+为什么要来翻源码呢？我对如何融合 SFT 和 DPO 两个阶段的工作比较好奇，于是翻阅了作者官方的博客：
+
+```python title="https://github.com/xfactlab/orpo/blob/main/trl/test_orpo_trainer_demo.py#L99"
+def build_dataset(tokenizer):
+    ds_train = load_dataset(
+        script_args.data_name, split="train", cache_dir=script_args.cache_dir
+    )
+
+    def chat_template_to_text(sample):
+        sample["prompt"] = [
+            tokenizer.apply_chat_template(
+                [{"role": "user", "content": item_prompt}],
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+            for item_prompt in sample["prompt"]
+        ]
+        sample["chosen"] = [
+            item_chosen[1]["content"] for item_chosen in sample["chosen"]
+        ]
+        sample["rejected"] = [
+            item_rejected[1]["content"] for item_rejected in sample["rejected"]
+        ]
+        return sample
+
+    ds_train = ds_train.map(chat_template_to_text, batched=True, num_proc=8)  # type: ignore
+
+    return ds_train
+```
+
+通过以上代码可以看出，这TM 不就是DPO的数据构造吗？ 基本上就是： [prompt, chosen, rejected] 这种格式。
+
+说实话，从数据构造的层面我只看到了 DPO 的构造，并没有看到 SFT 的数据构造，也就是说：SFT 阶段只能用DPO 的方法来做，这还能叫将 SFT 训练？？
+
+可能我的问题很幼稚，不过在阅读paper 和 code 的过程中确实有此疑问，为此我发了一个issue 来请教作者：https://github.com/xfactlab/orpo/issues/33 。