-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Add query prompts to Information Retrieval Evaluator
#2951
[feat
] Add query prompts to Information Retrieval Evaluator
#2951
Conversation
…tence-transformers into Prompting-on-evaluators
Hi @ArthurCamara, I'm reading the description of the PR and curious whether we need left padding in the text embedding models. I was also thinking about adding a left_pad argument following the LLM training procedure. But we may not need left_pad because text embeddings are not language modeling, and it only requires one forward pass to fetch the embedding, while language modeling typically requires multiple forward passes. Thus, there shouldn't be any gap (pad tokens) between the prompt and generated output in language modeling, leading to the left_pad operation. However, text embedding can take padding either on left or right, thus I think right_pad should already satisfy current text embedding models. Lmk your thoughts on this and correct me if I am wrong. |
Hi @ShengYun-Peng! I agree that, in theory, either left or right-padding should be fine when training a text embedding model. However, there are some cases, specially when training a decoder-only model that uses causal attention (like most LLMs today), we usually want to use the This is the approach taken by, for instance, E5-Mistral and the NV-Retriever models. |
That being said, I've just noticed an issue in the code that will be harder to fix than I expected. So, I'm reverting these changes and keeping only the prompts to the evaluators in this PR for now. |
Sounds good! I think this PR looks good at a glance. Do you want to mark it ready for review?
|
feat
] Add query prompts to Information Retrieval Evaluator
Thanks a bunch for this! I appreciate it.
|
…#2951) * Added the possibility of masking the prompts if the tokenizer is left-padded. * Simplify code * Remove unrelated changes * Move prompt_mask into the Transformer model * Added query and corpus prompts to Information Retrieval Evaluator * Fix for failing test * Fix for pooling when mask is not passed * Fix device placement for prompt_mask * Revert left-padding changes * Revert left-padding changes
Thank you so much for the explanation! I agree that if we are using the eos token for embedding, left padding will make it easier. |
…#2951) * Added the possibility of masking the prompts if the tokenizer is left-padded. * Simplify code * Remove unrelated changes * Move prompt_mask into the Transformer model * Added query and corpus prompts to Information Retrieval Evaluator * Fix for failing test * Fix for pooling when mask is not passed * Fix device placement for prompt_mask * Revert left-padding changes * Revert left-padding changes
Currently, there is no way to add prompts to the Evaluators. This PR adds this function to the Information Retrieval Evaluator, as discussed in #2848 (comment)
Older description that need to go another PR:
Most models, specially based on Mistral, are trained with left-padding in the tokenizer, instead of the common right-padding when evaluating. (e.g., LLM2Vec), but this is not covered in the current
prompt_length
strategy.Currently, If a batch with mismatched lengths is left-padded, the masking will mostly mask out padding tokens. This PR fixes that by adding a
mask_prompt
argument to the tokenize function. When this flag is set, the code tries to find the first non-padding token in each sentence, and will mask everything between that andprompt_length
, adding aprompt_mask
representation to the output dictionary.Perhaps a more "elegant" solution would be to replace
prompt_length
entirely, but this could break Instructor models.