Skip to content

Commit 97b7af4

Browse files
authored
[RL] Fix flashmask reward training (#10443)
* [RL] Fix flashmask reward training * [RL] Fix flashmask reward training * [RL] Fix flashmask reward training * [RL] Fix flashmask reward training
1 parent b3e99ef commit 97b7af4

File tree

15 files changed

+727
-726
lines changed

15 files changed

+727
-726
lines changed

docs/zh/llm/alignment/rm/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../../llm/alignment/rm/README.md

llm/alignment/rm/flashmask/README.md renamed to llm/alignment/rm/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,4 +39,6 @@ tar -zxvf ultrafeedback_binarized.tar.gz
3939

4040
```bash
4141
# RM 启动命令参考
42-
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./alignment/rm/flashmask/run_reward.py ./config/llama/rm_flashmask_argument.json
42+
cd llm/alignment/rm
43+
export PYTHONPATH=../../../:$PYTHONPATH
44+
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_reward.py ../../config/llama/rm_flashmask_argument.json

llm/alignment/rm/flashmask/data.py renamed to llm/alignment/rm/data.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ def preference_collate_fn(batch, max_seq_len=None, pad_token_id=0):
130130
difference = max_seq_len - len(sequence["input_ids"])
131131

132132
input_dict["input_ids"].append(sequence["input_ids"] + [pad_token_id] * difference)
133-
input_dict["position_ids"].append(sequence["position_ids"] + [pad_token_id] * difference)
133+
input_dict["position_ids"].append(sequence["position_ids"] + [0] * difference)
134134
if use_attn_mask_startend_row_indices:
135135
input_dict["attn_mask_startend_row_indices"].append(
136136
[
@@ -281,7 +281,7 @@ def zero_padding_process_collate_fn(batch, max_seq_len=None, pad_token_id=0):
281281
difference = max_seq_len - len(sequence["input_ids"])
282282

283283
input_dict["input_ids"].append(sequence["input_ids"] + [pad_token_id] * difference)
284-
input_dict["position_ids"].append(sequence["position_ids"] + [pad_token_id] * difference)
284+
input_dict["position_ids"].append(sequence["position_ids"] + [0] * difference)
285285
input_dict["labels"].append(sequence["labels"] + [-100] * difference)
286286
if use_attn_mask_startend_row_indices:
287287
input_dict["attn_mask_startend_row_indices"].append(
@@ -334,7 +334,7 @@ def process_collate_fn(batch, pad_token_id=0):
334334

335335
# input_ids: Tensor(seqL, ); position_ids: list, len(seqL); labels: Tensor(seqL, )
336336
input_dict["input_ids"].append(sequence["input_ids"].tolist() + [pad_token_id] * difference)
337-
input_dict["position_ids"].append(sequence["position_ids"] + [pad_token_id] * difference)
337+
input_dict["position_ids"].append(sequence["position_ids"] + [0] * difference)
338338
input_dict["labels"].append(sequence["labels"].tolist() + [-100] * difference)
339339
if use_attn_mask_startend_row_indices:
340340
input_dict["attn_mask_startend_row_indices"].append(

llm/alignment/rm/flashmask/reward_trainer.py

Lines changed: 0 additions & 115 deletions
This file was deleted.

0 commit comments

Comments
 (0)