Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add query prompts to Information Retrieval Evaluator #2951

Merged
merged 12 commits into from
Sep 30, 2024

Conversation

ArthurCamara
Copy link
Contributor

@ArthurCamara ArthurCamara commented Sep 23, 2024

Currently, there is no way to add prompts to the Evaluators. This PR adds this function to the Information Retrieval Evaluator, as discussed in #2848 (comment)


Older description that need to go another PR:

Most models, specially based on Mistral, are trained with left-padding in the tokenizer, instead of the common right-padding when evaluating. (e.g., LLM2Vec), but this is not covered in the current prompt_length strategy.
Currently, If a batch with mismatched lengths is left-padded, the masking will mostly mask out padding tokens. This PR fixes that by adding a mask_prompt argument to the tokenize function. When this flag is set, the code tries to find the first non-padding token in each sentence, and will mask everything between that and prompt_length, adding a prompt_mask representation to the output dictionary.

Perhaps a more "elegant" solution would be to replace prompt_length entirely, but this could break Instructor models.

@ArthurCamara ArthurCamara marked this pull request as draft September 23, 2024 09:01
@ArthurCamara ArthurCamara changed the title [feat] Add option to mask prompts with left-padded tokenizer [feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator Sep 23, 2024
@ShengYun-Peng
Copy link

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

@ArthurCamara
Copy link
Contributor Author

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models.
If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

@ArthurCamara
Copy link
Contributor Author

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models. If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

That being said, I've just noticed an issue in the code that will be harder to fix than I expected. So, I'm reverting these changes and keeping only the prompts to the evaluators in this PR for now.

@ArthurCamara ArthurCamara changed the title [feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator [feat] Add query prompts to Information Retrieval Evaluator Sep 25, 2024
@tomaarsen
Copy link
Collaborator

That being said, I've just noticed an issue in the code that will be harder to fix than I expected. So, I'm reverting these changes and keeping only the prompts to the evaluators in this PR for now.

Sounds good!
I'm glad that most "LLM-converted embedding models" use last-token pooling, as it's comparatively easy to implement.

I think this PR looks good at a glance. Do you want to mark it ready for review?

  • Tom Aarsen

@ArthurCamara ArthurCamara marked this pull request as ready for review September 25, 2024 10:08
@ArthurCamara ArthurCamara changed the title [feat] Add query prompts to Information Retrieval Evaluator [feat] Add query prompts to Information Retrieval Evaluator Sep 27, 2024
@tomaarsen
Copy link
Collaborator

Thanks a bunch for this! I appreciate it.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit 4f43de6 into UKPLab:master Sep 30, 2024
11 checks passed
tomaarsen pushed a commit to ArthurCamara/sentence-transformers that referenced this pull request Sep 30, 2024
…#2951)

* Added the possibility of masking the prompts if the tokenizer is left-padded.

* Simplify code

* Remove unrelated changes

* Move prompt_mask into the Transformer model

* Added query and corpus prompts to Information Retrieval Evaluator

* Fix for failing test

* Fix for pooling when mask is not passed

* Fix device placement for prompt_mask

* Revert left-padding changes

* Revert left-padding changes
@ShengYun-Peng
Copy link

Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong.

Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the <EOS> token as the sentence representation, instead of the first (e.g., the [CLS] token in BERT`) or the average of all tokens. This is really hard (and inefficient) to do if the batch sentences are right-padded, as the last token is now in some random position.

This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models. If you are not using the last token as the sentence representation (say, you have modified your attention mechanism to be bi-directional, like the NV-Embed-v2 model), then you should be fine using right padding.

Thank you so much for the explanation! I agree that if we are using the eos token for embedding, left padding will make it easier.

tomaarsen pushed a commit to alperctnkaya/sentence-transformers that referenced this pull request Oct 17, 2024
…#2951)

* Added the possibility of masking the prompts if the tokenizer is left-padded.

* Simplify code

* Remove unrelated changes

* Move prompt_mask into the Transformer model

* Added query and corpus prompts to Information Retrieval Evaluator

* Fix for failing test

* Fix for pooling when mask is not passed

* Fix device placement for prompt_mask

* Revert left-padding changes

* Revert left-padding changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants