-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default dataset_text_field
to "text"
#2078
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -445,9 +445,9 @@ def _prepare_dataset( | |||
dataset, | |||
tokenizer, | |||
packing, | |||
dataset_text_field, | |||
dataset_text_field: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super important, but maybe we should type the rest of the function signature for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree in general, but I didn't want to spend to much time on it since I plan to refactor SFT data preprocessing, to be more aligned with what's done in other trainers (which is clearer IMO)
What does this PR do?
Most datasets use
“text”
for the text field. Moreover, this is the column we document in trl as [standard language modeling] (https://huggingface.co/docs/trl/en/dataset_formats#overview-of-the-dataset-formats-and-types).Consequently, it makes sense to use this column by default for
dataset_text_field
.On the other hand, it allows us to drop
# needs dummy text field
as here:trl/examples/scripts/sft_vlm.py
Line 57 in 78249d9
Implications
dataset_text_field
can no longer beNone
in theSFTConfig
. We then use the following logic inSFTTrainer
: ifformatting_func
is provided, use it, otherwise, usedataset_text_field
.ConstantLengthDataset
whereformatting_func
anddataset_text_field
areNone
by default, and only one of them can be supplied.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.