[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara · 2024-09-27T09:52:34Z

Pull Request overview

Adds support to including prompts in the Trainer class
Supports masking the prompts in the Pooling when training.

Details

Currently, the encode method of SentenceTransformer supports adding prompts (or instructions) dynamically to the sentences by passing either prompt or prompt_name. However, this is not supported when training, as mentioned in #2945, as it uses the forward method instead.

This PR implements a similar functionality to the Trainer, by adding prompt parameter that can be:

str: The prompt will be appended to all sentences in the dataset
dict[str, str]: If the keys are column names, it will append the prompt to the respective column. If the training dataset is a dictionary of datasets, and the dictionary keys are names of the datasets, it will add the prompt to all the columns of the respective dataset.
dict[str, dict[str, str]]: Same as above, but assumes the first level is the dataset name and the second level are the column names.

As the prompts can be dynamic (changing for each dataset and column), they are injected in the sentences by the get_train|test|eval|_dataloader methods, by calling add_prompts_to_dataset, which solves for each dataset and column which prompt to inject.

Finally, the add_prompts_to_dataset also adds <column_name>_prompt_length columns that, when passed to Pooling method with include_prompt=False, will mask the instructions properly as well. (currently this is only explicitly for Instructor models, but can be set by the user by calling model.set_pooling_include_prompt(include_prompt=False)

…-padded.

…tence-transformers into Prompting-on-evaluators

tomaarsen · 2024-09-30T15:53:46Z

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here.
I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).
I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:
- Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?
- Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

I'm curious to hear your thoughts on this.

Tom Aarsen

ArthurCamara · 2024-10-02T08:33:36Z

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here. I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).

I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:

Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?

Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

This was one of the things I was considering, to change the Collator instead of the dataset itself. But I had issues with Accelerator and DDP before when the data was not exclusively tensors (i.e., strings), but I think we can walk around it within the collator. I will give it a shot and let you know.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

Agreed. =)

I'm curious to hear your thoughts on this.

Tom Aarsen

…/sentence-transformers into trainer-with-prompt-masking

JosephGatto · 2024-11-03T22:17:46Z

Hi thanks for implementing this. Any guide on how to fine-tune with prompts?

tomaarsen · 2024-11-05T09:44:04Z

Hello!

Until this is integrated, I would recommend manually adding the prompts to your training datasets. E.g.:

from datasets import load_dataset
from typing import Dict, List, Any

def prepend_prompt(batch: Dict[str, List[Any]], prompts: Dict[str, str] | None = None) -> Dict[str, List[Any]]:
    if not prompts:
        return batch

    for column_name, prompt in prompts.items():
        batch[column_name] = [prompt + value for value in batch[column_name]]
    return batch

train_dataset = load_dataset("sentence-transformers/natural-questions", split="train")
train_dataset = train_dataset.map(
    prepend_prompt,
    batched=True,
    fn_kwargs={"prompts": {"question": "Represent this sentence for searching relevant passages: "}}
)
print(train_dataset[0])
# {'query': 'Represent this sentence for searching relevant passages: when did richmond last play in a preliminary final', 'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next."}

And the rest is the same as the normal training: https://sbert.net/docs/sentence_transformer/training_overview.html

Tom Aarsen

JosephGatto · 2024-11-05T14:12:49Z

Hey thanks so much for the quick reply. My main concern here would be if pooling is being done on just the text (and excluding the prompt). I believe in the INSTRUCTOR paper they do not include the embeddings of the prompt during mean pooling. Would this solution take care of that?

tomaarsen · 2024-11-05T15:10:53Z

Indeed, my solution only works if you're including the prompt in the pooling. If you're not, i.e. with setting this to False:

sentence-transformers/sentence_transformers/models/Pooling.py

Line 60 in b9316f9

include_prompt: bool = True,

Then you must use this PR. You can use:

pip install git+https://github.com/ArthurCamara/sentence-transformers@trainer-with-prompt-masking

and then use the regular training with one extra parameter in the SentenceTransformerTrainer:

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    prompts={
        "query": "Represent this sentence for searching relevant passages: ",
    },
    evaluator=dev_evaluator,
)

The prompts can be:

a prompt string
a column name to prompt mapping
a dataset to prompt mapping (if you use a dataset dict to train on multiple datasets simultaneously)
a dataset to column name to prompt mapping (i.e. nested dicts, only if you use a dataset dict to train on multiple datasets simultaneously)

I do want to warn you that I'm about to fully overhaul this PR, although the usage will remain the same.

Tom Aarsen

JosephGatto · 2024-11-05T16:27:48Z

Thanks so much. And if I was interested in training with dynamic prompts (unique prompt per sample) would that be possible with the methods you described?

tomaarsen · 2024-11-05T16:34:49Z

Unique per sample is not possible here without subclassing the Trainer, no. You could use a unique sample per dataset, if that helps. I didn't think that a unique prompt per sample was a notable use case, so I didn't think to integrate it.

JosephGatto · 2024-11-05T17:30:13Z

Got it. Thank you!

tomaarsen · 2024-11-05T22:45:08Z

Heya @ArthurCamara,

I've overhauled the prompt prepending once more, as I still had some slight concerns with the previous implementations after some experimentation. You have worked on 2 implementations, and I'm now proposing a third as well:

'Greedily' .map over each dataset to add the prompt string to each dataset.
'Lazily' prepends prompts in the data collator in the Trainer.
Use .set_transform for Dataset(Dict) and .map for IterableDataset(Dict) to add the prompt string to each dataset.

I had concerns with the first two:

I'm wary that this results in large memory usage and/or cache files.
During model card generation, I sample from the datasets to use in the model card (example), I'd also like for the prompts to be included, which isn't the case if the prompt prepending only exists in the Trainer.

After getting some valuable recommendations by the Datasets team and @lhoestq in particular, I'm now using .set_transform and .map to lazily apply 1) the prompts (if provided), 2) prompt lengths (if needed for pooling), and 3) dataset name (if needed for determining which loss to use). The implementation now lives as a 1-time "update" of the provided train/eval datasets, so the model card can easily fetch samples that include the prompts.

I've also trained 2 near-identical models:

https://huggingface.co/tomaarsen/mpnet-base-nq: mpnet-base finetuned on natural-questions and evaluated on NanoBEIR.
https://huggingface.co/tomaarsen/mpnet-base-nq-prompts: See above, except with prompts={"query": "query: ", "answer": "document: "} & query_prompt and corpus_prompt provided to the NanoBEIR evaluator. Identical seeds, etc.

The former consistently performs slightly worse than the model with the prompts:

Also, the prompts model shows the prompts in the model card easily: https://huggingface.co/tomaarsen/mpnet-base-nq-prompts#natural-questions

Lastly, I built an extensive training suite for this feature because there are a LOT of moving parts between training, evaluation, iterable datasets, and the various prompt formats.

I'm curious about your thoughts on my proposal @ArthurCamara, as I know you're using this yourself too! And one final question:

Do you think that prompts should be a parameter in SentenceTransformerTrainer or in SentenceTransformerTrainingArguments?

Tom Aarsen

ArthurCamara · 2024-11-06T10:55:15Z

Heya @ArthurCamara,

I've overhauled the prompt prepending once more, as I still had some slight concerns with the previous implementations after some experimentation. You have worked on 2 implementations, and I'm now proposing a third as well:

'Greedily' .map over each dataset to add the prompt string to each dataset.

'Lazily' prepends prompts in the data collator in the Trainer.

Use .set_transform for Dataset(Dict) and .map for IterableDataset(Dict) to add the prompt string to each dataset.

I had concerns with the first two:

I'm wary that this results in large memory usage and/or cache files.

During model card generation, I sample from the datasets to use in the model card (example), I'd also like for the prompts to be included, which isn't the case if the prompt prepending only exists in the Trainer.

Adding the prompts to the model card is something very useful that I haven't thought of. Nice.

After getting some valuable recommendations by the Datasets team and @lhoestq in particular, I'm now using .set_transform and .map to lazily apply 1) the prompts (if provided), 2) prompt lengths (if needed for pooling), and 3) dataset name (if needed for determining which loss to use). The implementation now lives as a 1-time "update" of the provided train/eval datasets, so the model card can easily fetch samples that include the prompts.

Nice to learn something new. Didn't know about set_transform This is a cleaner solution than doing multiple passes over the datasets.

I've also trained 2 near-identical models:

https://huggingface.co/tomaarsen/mpnet-base-nq: mpnet-base finetuned on natural-questions and evaluated on NanoBEIR.

https://huggingface.co/tomaarsen/mpnet-base-nq-prompts: See above, except with prompts={"query": "query: ", "answer": "document: "} & query_prompt and corpus_prompt provided to the NanoBEIR evaluator. Identical seeds, etc.
The former consistently performs slightly worse than the model with the prompts:

Also, the prompts model shows the prompts in the model card easily: https://huggingface.co/tomaarsen/mpnet-base-nq-prompts#natural-questions

Neat. I like the way prompting helps to disentangle the representations of query and documents even in smaller models

Lastly, I built an extensive training suite for this feature because there are a LOT of moving parts between training, evaluation, iterable datasets, and the various prompt formats.

I'm curious about your thoughts on my proposal @ArthurCamara, as I know you're using this yourself too! And one final question:

Do you think that prompts should be a parameter in SentenceTransformerTrainer or in SentenceTransformerTrainingArguments?

Good question. I want to say it should be in the Arguments, so it can be easily swapped out when testing with different configurations. But I'm not sure how of a good UX it will be to pass a double-nested dictionary as an argument to training script (of course, reading from a json/yaml file is also an option).

Tom Aarsen

into pr-2964

This is just safer & less hacky - I encountered a nasty bug where only returning 1 value (because we technically only need 1) results in all other samples being skipped. Not great.

This also already mentions the v3.3 release - a bit premature, but it's a tad simpler this way

tomaarsen · 2024-11-08T15:27:24Z

Thanks a bunch for spearheading this. I didn't expect that the prompts would have such a notable impact (0.66% and 0.90% relative NDCG@10 across mpnet-base and bert-base-uncased, respectively), but I'm glad that they do.

This will be included as one of the 4 major features in Monday's v3.3 release, alongside the NanoBEIREvaluator which will be another major feature. I really appreciate your work on these.

Tom Aarsen

ArthurCamara and others added 15 commits September 23, 2024 07:55

Added the possibility of masking the prompts if the tokenizer is left…

7dc7990

…-padded.

Simplify code

8d7b88b

Remove unrelated changes

c92e334

Move prompt_mask into the Transformer model

005039f

Merge branch 'UKPLab:master' into Prompting-on-evaluators

f95cb46

Added query and corpus prompts to Information Retrieval Evaluator

0effd4d

Merge branch 'Prompting-on-evaluators' of github.com:ArthurCamara/sen…

b653197

…tence-transformers into Prompting-on-evaluators

Fix for failing test

d856b47

Fix for pooling when mask is not passed

ad21eb7

Fix device placement for prompt_mask

82b8c7e

Revert left-padding changes

c7a5298

Revert left-padding changes

7cb8a51

Added support to prompts in the Trainer

c49ca90

Simplify logic and add prompt to eval dataset

781f1d7

add prompt to test dataset

72713b2

ArthurCamara mentioned this pull request Sep 27, 2024

Best practice to train on multiple datasets with different prompts #2945

Closed

ArthurCamara changed the title ~~Trainer with prompt masking~~ [feat] Trainer with prompts and prompt masking Sep 27, 2024

ArthurCamara added 3 commits September 30, 2024 17:34

Added support to prompts in the Trainer

810f5bf

Simplify logic and add prompt to eval dataset

1dea917

add prompt to test dataset

bf9eb80

tomaarsen force-pushed the trainer-with-prompt-masking branch 2 times, most recently from 354cb65 to bf9eb80 Compare September 30, 2024 15:36

Merge branch 'trainer-with-prompt-masking' of github.com:ArthurCamara…

86dd847

…/sentence-transformers into trainer-with-prompt-masking

ArthurCamara force-pushed the trainer-with-prompt-masking branch from 86dd847 to bf9eb80 Compare October 2, 2024 08:40

ArthurCamara and others added 4 commits October 2, 2024 10:49

Merge branch 'UKPLab:master' into trainer-with-prompt-masking

a12e5ca

Merge branch 'trainer-with-prompt-masking' of github.com:ArthurCamara…

2356542

…/sentence-transformers into trainer-with-prompt-masking

rename prompt to prompts

445ae63

Move prompts to collator

e855bf5

tomaarsen added 4 commits October 30, 2024 12:45

Remove unused argument

1e75af7

Fix typos

8b573e8

Always tokenize a list, otherwise the prompt length is off

f9c1c94

Use a simple int as a prompt length instead of a tensor

3204a26

tomaarsen added 2 commits November 5, 2024 13:45

Add prompts via .set_transform/.map to Dataset rather than via Collator

c9be6c1

Merge branch 'master' into pr-2964

f3ba38e

tomaarsen added 2 commits November 5, 2024 23:45

Remove dead code/TODO

d9d485b

Merge branch 'master' into pr-2964

e3b334c

tomaarsen added 11 commits November 6, 2024 12:47

Move prompts to SentenceTransformersArguments

22fc64a

Only include dataset_name if strictly needed, stricter tests

50e1613

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

f72e3e9

into pr-2964

(Unrelated) Warn if using a batch sampler with a streaming dataset

b7e3c25

Always return batch_size samples in transform

70357a7

This is just safer & less hacky - I encountered a nasty bug where only returning 1 value (because we technically only need 1) results in all other samples being skipped. Not great.

Fix bug with prompts + prompt_lengths & NoDuplicatesBatchSampler

f1b3688

(Unrelated) add NanoBEIREvaluator to docs

a2b60b5

Add Training with Prompts docs + example script

6f16f16

This also already mentions the v3.3 release - a bit premature, but it's a tad simpler this way

Slight updates to the docs

abc29ea

Simplify/revert slightly in the data collator

7f1bda8

Merge branch 'master' into pr-2964

e5a2d7e

tomaarsen merged commit 7be3eac into UKPLab:master Nov 8, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Trainer with prompts and prompt masking #2964

[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara commented Sep 27, 2024 •

edited

Loading

tomaarsen commented Sep 30, 2024

ArthurCamara commented Oct 2, 2024

JosephGatto commented Nov 3, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024 •

edited

Loading

ArthurCamara commented Nov 6, 2024

tomaarsen commented Nov 8, 2024

[feat] Trainer with prompts and prompt masking #2964

[feat] Trainer with prompts and prompt masking #2964

Conversation

ArthurCamara commented Sep 27, 2024 • edited Loading

Pull Request overview

Details

tomaarsen commented Sep 30, 2024

ArthurCamara commented Oct 2, 2024

JosephGatto commented Nov 3, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024

JosephGatto commented Nov 5, 2024

tomaarsen commented Nov 5, 2024 • edited Loading

ArthurCamara commented Nov 6, 2024

tomaarsen commented Nov 8, 2024

[`feat`] Trainer with prompts and prompt masking #2964

[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara commented Sep 27, 2024 •

edited

Loading

tomaarsen commented Nov 5, 2024 •

edited

Loading