[`feat`] Add query prompts to Information Retrieval Evaluator #2951

ArthurCamara · 2024-09-23T08:11:41Z

Currently, there is no way to add prompts to the Evaluators. This PR adds this function to the Information Retrieval Evaluator, as discussed in #2848 (comment)

Older description that need to go another PR:

Most models, specially based on Mistral, are trained with left-padding in the tokenizer, instead of the common right-padding when evaluating. (e.g., LLM2Vec), but this is not covered in the current prompt_length strategy.
Currently, If a batch with mismatched lengths is left-padded, the masking will mostly mask out padding tokens. This PR fixes that by adding a mask_prompt argument to the tokenize function. When this flag is set, the code tries to find the first non-padding token in each sentence, and will mask everything between that and prompt_length, adding a prompt_mask representation to the output dictionary.

Perhaps a more "elegant" solution would be to replace prompt_length entirely, but this could break Instructor models.

…-padded.

…tence-transformers into Prompting-on-evaluators

ShengYun-Peng · 2024-09-24T16:44:01Z

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

ArthurCamara · 2024-09-25T09:04:02Z

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models.
If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

ArthurCamara · 2024-09-25T09:52:20Z

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models. If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

That being said, I've just noticed an issue in the code that will be harder to fix than I expected. So, I'm reverting these changes and keeping only the prompts to the evaluators in this PR for now.

tomaarsen · 2024-09-25T10:07:32Z

That being said, I've just noticed an issue in the code that will be harder to fix than I expected. So, I'm reverting these changes and keeping only the prompts to the evaluators in this PR for now.

Sounds good!
I'm glad that most "LLM-converted embedding models" use last-token pooling, as it's comparatively easy to implement.

I think this PR looks good at a glance. Do you want to mark it ready for review?

Tom Aarsen

tomaarsen · 2024-09-30T14:42:50Z

Thanks a bunch for this! I appreciate it.

Tom Aarsen

…#2951) * Added the possibility of masking the prompts if the tokenizer is left-padded. * Simplify code * Remove unrelated changes * Move prompt_mask into the Transformer model * Added query and corpus prompts to Information Retrieval Evaluator * Fix for failing test * Fix for pooling when mask is not passed * Fix device placement for prompt_mask * Revert left-padding changes * Revert left-padding changes

ShengYun-Peng · 2024-10-01T15:33:46Z

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models. If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

Thank you so much for the explanation! I agree that if we are using the eos token for embedding, left padding will make it easier.

…#2951) * Added the possibility of masking the prompts if the tokenizer is left-padded. * Simplify code * Remove unrelated changes * Move prompt_mask into the Transformer model * Added query and corpus prompts to Information Retrieval Evaluator * Fix for failing test * Fix for pooling when mask is not passed * Fix device placement for prompt_mask * Revert left-padding changes * Revert left-padding changes

ArthurCamara added 4 commits September 23, 2024 07:55

Added the possibility of masking the prompts if the tokenizer is left…

7dc7990

…-padded.

Simplify code

8d7b88b

Remove unrelated changes

c92e334

Move prompt_mask into the Transformer model

005039f

ArthurCamara marked this pull request as draft September 23, 2024 09:01

ArthurCamara and others added 3 commits September 23, 2024 15:01

Merge branch 'UKPLab:master' into Prompting-on-evaluators

f95cb46

Added query and corpus prompts to Information Retrieval Evaluator

0effd4d

Merge branch 'Prompting-on-evaluators' of github.com:ArthurCamara/sen…

b653197

…tence-transformers into Prompting-on-evaluators

ArthurCamara changed the title ~~[feat] Add option to mask prompts with left-padded tokenizer~~ [feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator Sep 23, 2024

ArthurCamara mentioned this pull request Sep 23, 2024

[feat] Update mine_hard_negatives to using a full corpus and multiple positives #2848

Merged

ArthurCamara added 3 commits September 23, 2024 13:29

Fix for failing test

d856b47

Fix for pooling when mask is not passed

ad21eb7

Fix device placement for prompt_mask

82b8c7e

tomaarsen mentioned this pull request Sep 24, 2024

Best practice to train on multiple datasets with different prompts #2945

Closed

ArthurCamara added 2 commits September 25, 2024 09:48

Revert left-padding changes

c7a5298

Revert left-padding changes

7cb8a51

ArthurCamara changed the title ~~[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator~~ [feat] Add query prompts to Information Retrieval Evaluator Sep 25, 2024

ArthurCamara marked this pull request as ready for review September 25, 2024 10:08

ArthurCamara changed the title ~~[feat] Add query prompts to Information Retrieval Evaluator~~ [feat] Add query prompts to Information Retrieval Evaluator Sep 27, 2024

ArthurCamara mentioned this pull request Sep 27, 2024

[feat] Integrate NanoBeIR datasets; use model.similarity by default in evaluators #2966

Merged

tomaarsen merged commit 4f43de6 into UKPLab:master Sep 30, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Add query prompts to Information Retrieval Evaluator #2951

[`feat`] Add query prompts to Information Retrieval Evaluator #2951

ArthurCamara commented Sep 23, 2024 •

edited

Loading

ShengYun-Peng commented Sep 24, 2024

ArthurCamara commented Sep 25, 2024

ArthurCamara commented Sep 25, 2024

tomaarsen commented Sep 25, 2024

tomaarsen commented Sep 30, 2024

ShengYun-Peng commented Oct 1, 2024

[feat] Add query prompts to Information Retrieval Evaluator #2951

[feat] Add query prompts to Information Retrieval Evaluator #2951

Conversation

ArthurCamara commented Sep 23, 2024 • edited Loading

ShengYun-Peng commented Sep 24, 2024

ArthurCamara commented Sep 25, 2024

ArthurCamara commented Sep 25, 2024

tomaarsen commented Sep 25, 2024

tomaarsen commented Sep 30, 2024

ShengYun-Peng commented Oct 1, 2024

[`feat`] Add query prompts to Information Retrieval Evaluator #2951

[`feat`] Add query prompts to Information Retrieval Evaluator #2951

ArthurCamara commented Sep 23, 2024 •

edited

Loading