-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: llava-next fails to consider padding_side
during Training
#32386
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jp1924 ! Thanks for opening a PR! We usually suggest to use left padding if in inference mode, can you give example of when you have to use "right" padding?
IMO the error message in this case is indeed not self-explanatory, so we can try to raise a better error/warning. WDYT?
@zucchini-nlp, Thank you for your response. yes, I agree with improving the error/warning message. and I did not fully understand what you meant by Recently, many users have been setting the padding_side to left not only during inference The modeling_llava_next.py file has been implemented to allow self.padding_side to be user-defined, but it is not used anywhere. If the intention was to have padding_side as right during training and left during inference, it would be better to cancel this PR. |
Ah, this is a trend I wasn't aware of. Out of curiosity, what is the reason behind this? If that is a commonly used thing, we can change the code as you propose and simply raise a warning that |
@zucchini-nlp When handling ignore_idx for prompts, using left padding makes it easier to ignore the parts of the question and padding that are not part of the answer, simplifying the preprocessing code. For example, if <pad><pad><pad>### User:
Hello
### Assistance:
World We can handle the parts after the answer using ignore_idx to exclude them from the loss calculation. Left padding allows for simple slicing to handle this efficiently, offering several advantages. Another reason is consistency: it's better to keep the padding side the same rather than switching back and forth between padding sides during training and inference, aligning with Repositories commonly used for training LLMs, such as TRL, often have code written with the assumption of left padding With the rise of instruction tuning in popular language models like Llama, Qwen, and Mistral, the use of left padding has also gained popularity. |
Not sure of the code you provided enforces left padding, as it only masks out the parts with "Human's turn". All the operations to ignore padding is done when calling But if there are people willing to train/tune with left padding, we can remove the hard check on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks for working on this! I recommend to make the message more detailed and warn only once, otherwise the terminal will be filled with this warning for each forward pass.
Also, can run make- fix-copies
so that the changes are propagated to llava-next-video model also? And adapt the test we have for padding-side given the changes (
transformers/tests/models/llava_next/test_modeling_llava_next.py
Lines 475 to 507 in f5f1e52
@slow | |
@require_bitsandbytes | |
def test_padding_side_when_merging_inputs(self): | |
model = LlavaNextForConditionalGeneration.from_pretrained( | |
"llava-hf/llava-v1.6-mistral-7b-hf", | |
load_in_4bit=True, | |
) | |
url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
lowres_url = "https://4.img-dpreview.com/files/p/TS560x560~forums/56876524/03975b28741443319e9a94615e35667e" | |
cats_image = Image.open(requests.get(url, stream=True).raw) | |
lowres_img = Image.open(requests.get(lowres_url, stream=True).raw) | |
inputs_batched = self.processor( | |
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True | |
).to(torch_device) | |
# model is in eval mode by default so we should get pad on the left side | |
# we can check the first hidden-states (aka inputs embeds) | |
# the first element was lo-res image and we expect the first 1414 tokens to be all pads | |
output_eval = model(**inputs_batched, output_hidden_states=True) | |
self.assertTrue((output_eval.hidden_states[0][0, :1414, ...] == 0).all().item()) | |
# otherwise padding is on the right side, so it's last 1414 tokens | |
self.processor.padding_side = "right" | |
inputs_batched = self.processor( | |
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True | |
).to(torch_device) | |
model.train() | |
with torch.no_grad(): | |
output_train = model(**inputs_batched, output_hidden_states=True) | |
self.assertTrue((output_train.hidden_states[0][0, -1414:, ...] == 0).all().item()) |
my traceback in #29850 . two comments:
|
I am not very pro of raising a For the potential |
not sure we are talking about the same ValueError. i was referring to the specific error that will get raised if the inputs were prepared with a different padding from what model expects. i.e. the error in the traceback in original pr comment (line 626 of the code). whereas you seem to be talking about raising an error if the side is not what is 'conventional' but potentially i misunderstood you? |
@alyakin314 Got you! The way you suggest it is to add more details to error message, which I didn't see sorry. Yes, that totally works, just needs different wording imo We still should try to infer padding when not training though, that is related to the current PR which is changing the way padding behaves for training mode. This should guide users to easier debugging |
word explanation without the variable names?
that is probably true. presumably by inferring which side has a token == |
I meant smth like this "This prevents correct indexing and breaks batch generation."
|
sure, changed the wording in the pr to the branch this pr is from (that's my only way of editing it without creating a separate pr, right?). tbh, the only practical difference catching whether the pad token are on the 'wrong' side is that there will be an exception raised at the time of that check that will definitively say that what's wrong is the padding side. as opposed to current behavior where the one my edit is in being raised that says it's either this or that (because as we know - it breaks here anyway). happy to do it, but it's a few lines of extra code for the benefit of just having two separate error messages. |
I agree with @alyakin314's opinion that it would be good to notify users when there is a mismatch in As @zucchini-nlp mentioned above, I think it would be better to modify modeling_llava_next.py#L523-L533 as follows to display a warning/error when a mismatch occurs. I didn't add _left_padding = torch.any(attention_mask[:, 0] == 0)
_right_padding = torch.any(attention_mask[:, -1] == 0)
left_padding = self.padding_side == "left"
if batch_size > 1:
# There may be cases where batch_size is 2 or more, but there is no padding; in such cases, pass without else.
if _left_padding and _right_padding:
# In the following elif statements, either _left or _right is always False.
raise ValueError(f"both side of attention_mask has zero, invalid. {attention_mask}")
elif _right_padding and left_padding:
"error or warnings"
"If you want to add a warning, it would be good to set left_padding=False to avoid errors."
elif _left_padding and not left_padding:
"error or warnings"
"If you want to add a warning, it would be good to set left_padding=True to avoid errors." What do you think about the above modifications, @alyakin314 and @zucchini-nlp? If I have misunderstood or pointed out something completely different, please let me know. thank you |
thanks @jp1924
over something like
? |
The reason I used attention_mask is partly that the existing code uses it to check padding_side.
I agree with the suggestion that it is more reasonable to switch to using input_ids since attention_mask can be optional. |
Unfortunately the current LLaVa impl doesn't allow one to have no attn mask and will fail in any case down the line. Imo checking input ids is also not perfect and forces us to save pad token in model' config, along with tokenizer. In general, this padding thing is not the best solution, and we'll get rid of the huge merging block soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for refining warning messages :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@zucchini-nlp It seems like a good idea to include a warning when there is a padding_side mismatch between the tokenizer and the model. |
Yes, that is what I meant above as a possible option. We don't actually need to bloat with even more warnings, simple modify Oh, totally forgot, can you check the test I mentioned is passing and if not adapt it to the new changes? ( transformers/tests/models/llava_next/test_modeling_llava_next.py Lines 475 to 507 in f5f1e52
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this to make sure the behaviour is expected! Reviewing as @LysandreJik is off
logger.warning_once( | ||
"Padding side is set to 'left' but the model is in training mode. For training " | ||
"it is recommended to set `model.padding_side='right' and `processor.tokenizer.padding_side='right'`. " | ||
"If that's intended, ignore this warning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see how we get on here and if anyone complains - emitting warnings when there's no action to take can be annoying for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but isn't this whole field just - seeing warnings that you can't do anything about in your logs during init??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what you mean by "whole field".
In general warnings should be used sparingly and there should be something the user can do to remove them -- this applies most of the time here -- so I think it's OK. I don't expect this to cause much issue - but experience has shown that people get annoyed if there's sudden, new warning messages they can't remove. Using warning_once
should mitigate some of this
@zucchini-nlp, there's a conflict between the test code I wrote and the test code in the main branch. What should I do? Oh, and could you check if there's any issue with the code I worked on? If there is, please let me know, and I'll fix it. |
acb4915
to
df32347
Compare
Oops...... @zucchini-nlp @amyeroberts, @alyakin314, my apologies. Embarrassingly, while fiddling with the settings to resolve the conflict, the PR was closed. I've reopened the #32679, can you check it out? @alyakin314 And if you don't mind, could you do a PR on the Fix_llava_next_bug branch again regarding the warning? |
What does this PR do?
Hi, Hugging Face
I am opening a PR to address a bug discovered during the training of llava-next.
An error occurs during training when the
padding_side
of the tokenizer is set toleft
.The root cause of this issue is the line
left_padding = True if not self.training else False
.Even though
self.padding_side
isleft
, it is forcibly changed toright
, leading to a size mismatch inimage_to_overwrite &= val
even if padding is considered.After changing the problematic line to left_padding = self.padding_side == "left", the code functions as expected.
In my opinion, determining the
padding_side
throughself.padding_side
rather than relying onself.training
provides greater flexibility.Reproduction
environment
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp