Skip to content

Commit

Permalink
[BUG] 额外添加20个样本,防止制作预训练数据的时候超过index (#8789)
Browse files Browse the repository at this point in the history
  • Loading branch information
JunnYu authored Jul 23, 2024
1 parent c4efddf commit 7c18d9d
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion paddlenlp/data/causal_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,11 @@ def get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples
# Add 0.5% (the 1.005 factor) so in case the bleding dataset does
# not uniformly distribute the number of samples, we still have
# samples left to feed to the network.
# (NOTE, yujun06): This is a workaround to avoid issues with indexing in the blending dataset. Therefore, we need to add 20 samples to each dataset.
datasets_train_valid_test_num_samples = []
for weight in weights:
datasets_train_valid_test_num_samples.append(
[int(math.ceil(val * weight * 1.005)) for val in train_val_test_num_samples]
[int(math.ceil(val * weight * 1.005)) + 20 for val in train_val_test_num_samples]
)

return prefixes, weights, datasets_train_valid_test_num_samples
Expand Down

0 comments on commit 7c18d9d

Please sign in to comment.