Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: llava-next fails to consider padding_side during Training #32386

Closed
wants to merge 0 commits into from

Conversation

jp1924
Copy link
Contributor

@jp1924 jp1924 commented Aug 2, 2024

What does this PR do?

Hi, Hugging Face

I am opening a PR to address a bug discovered during the training of llava-next.

An error occurs during training when the padding_side of the tokenizer is set to left.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/root/workspace/llava/bug-fix.py", line 37, in <module>
    outputs = model(**input_param)
  File "/root/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/transformers/src/transformers/models/llava_next/modeling_llava_next.py", line 807, in forward
    inputs_embeds, attention_mask, position_ids, labels, _ = self._merge_input_ids_with_image_features(
  File "/root/transformers/src/transformers/models/llava_next/modeling_llava_next.py", line 627, in _merge_input_ids_with_image_features
    raise ValueError(
ValueError: image_to_overwrite.sum()=tensor(4694) != num_image_features=4680 The input provided to the model are wrong. The number of image tokens is 4678 while the number of image given to the model is 2. This prevents correct indexing and breaks batch generation.

The root cause of this issue is the line left_padding = True if not self.training else False.

Even though self.padding_side is left, it is forcibly changed to right, leading to a size mismatch in image_to_overwrite &= val even if padding is considered.

After changing the problematic line to left_padding = self.padding_side == "left", the code functions as expected.

In my opinion, determining the padding_side through self.padding_side rather than relying on self.training provides greater flexibility.

Reproduction

import numpy as np
import torch

from transformers import LlavaNextConfig, LlavaNextForConditionalGeneration, LlavaNextProcessor


model_name_or_path = "llava-hf/llava-v1.6-mistral-7b-hf"
config = LlavaNextConfig.from_pretrained(model_name_or_path)
config.text_config.num_hidden_layers = 1
config.vision_config.num_hidden_layers = 1
image_token_index = torch.tensor([config.image_token_index])

processor = LlavaNextProcessor.from_pretrained(model_name_or_path)
processor.tokenizer.padding_side = "left"

model = LlavaNextForConditionalGeneration(config)

image = torch.zeros(
    [10, 3, config.vision_config.image_size, config.vision_config.image_size], device=model.device, dtype=torch.float32
)

text_1 = "Today, we are thrilled to present LLaVA-NeXT"
text_2 = "with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks."

text_1_input_ids = processor(text=text_1)["input_ids"][0]
text_2_input_ids = processor(text=text_2)["input_ids"][0]

text_1_input_ids = torch.concat((image_token_index, text_1_input_ids))
text_2_input_ids = torch.concat((image_token_index, text_2_input_ids))


input_param = processor.tokenizer.pad([{"input_ids": text_1_input_ids}, {"input_ids": text_2_input_ids}])
input_param["pixel_values"] = torch.stack((image, image))
input_param["image_sizes"] = torch.tensor([[1980, 2640], [1980, 2640]])
input_param = {k: v.to(model.device) for k, v in input_param.items()}

outputs = model(**input_param)

environment

- `transformers` version: 4.44.0.dev0
- Platform: Linux-6.8.0-36-generic-x86_64-with-glibc2.39
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- GPU type: NVIDIA A100 80GB PCIe

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jp1924 ! Thanks for opening a PR! We usually suggest to use left padding if in inference mode, can you give example of when you have to use "right" padding?

IMO the error message in this case is indeed not self-explanatory, so we can try to raise a better error/warning. WDYT?

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 2, 2024

@zucchini-nlp, Thank you for your response.

yes, I agree with improving the error/warning message.

and I did not fully understand what you meant by can you give an example of when you have to use 'right' padding? Could you please explain this in more detail?


Recently, many users have been setting the padding_side to left not only during inference
but also during training when using LLM and LMM.
However, in the llava-next code, padding_side is automatically set to right when self.training is True,
and to left when self.training is False.
This is not an issue if the tokenizer.padding_side is right during training,
but if it is left, it can cause problems like the ones I encountered.

The modeling_llava_next.py file has been implemented to allow self.padding_side to be user-defined, but it is not used anywhere.
My intention is to modify it so that the padding_side of the tokenizer matches
the model's padding_side, enabling it to process images correctly.

If the intention was to have padding_side as right during training and left during inference, it would be better to cancel this PR.

@zucchini-nlp
Copy link
Member

Recently, many users have been setting the padding_side to left not only during inference
but also during training when using LLM and LMM.

Ah, this is a trend I wasn't aware of. Out of curiosity, what is the reason behind this? If that is a commonly used thing, we can change the code as you propose and simply raise a warning that padding is left even though in training mode. If that's intended, ignore this warning by checking self.training

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 2, 2024

@zucchini-nlp
Using left padding during training reduces the amount of code needed for convenience.

When handling ignore_idx for prompts, using left padding makes it easier to ignore the parts of the question and padding that are not part of the answer, simplifying the preprocessing code.

For example, if World is the answer that the LLM should output:

<pad><pad><pad>### User:
Hello

### Assistance:
World

We can handle the parts after the answer using ignore_idx to exclude them from the loss calculation. Left padding allows for simple slicing to handle this efficiently, offering several advantages.

Another reason is consistency: it's better to keep the padding side the same rather than switching back and forth between padding sides during training and inference, aligning with model.generate.

Repositories commonly used for training LLMs, such as TRL, often have code written with the assumption of left padding

With the rise of instruction tuning in popular language models like Llama, Qwen, and Mistral, the use of left padding has also gained popularity.

@zucchini-nlp
Copy link
Member

@jp1924

Not sure of the code you provided enforces left padding, as it only masks out the parts with "Human's turn". All the operations to ignore padding is done when calling super() class DataCollatorForLanguageModeling, which is usually padding-side agnostic.

But if there are people willing to train/tune with left padding, we can remove the hard check on padding_side. I am okay with the PR, can you please add a warning that checks self.training and warns that the user has left padding for training which is not the conventional way. Same warning if not training and padding is on the right. That way we will have more transparency on what is happening to help one debug their script :)

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks for working on this! I recommend to make the message more detailed and warn only once, otherwise the terminal will be filled with this warning for each forward pass.

Also, can run make- fix-copies so that the changes are propagated to llava-next-video model also? And adapt the test we have for padding-side given the changes (

@slow
@require_bitsandbytes
def test_padding_side_when_merging_inputs(self):
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
load_in_4bit=True,
)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
lowres_url = "https://4.img-dpreview.com/files/p/TS560x560~forums/56876524/03975b28741443319e9a94615e35667e"
cats_image = Image.open(requests.get(url, stream=True).raw)
lowres_img = Image.open(requests.get(lowres_url, stream=True).raw)
inputs_batched = self.processor(
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True
).to(torch_device)
# model is in eval mode by default so we should get pad on the left side
# we can check the first hidden-states (aka inputs embeds)
# the first element was lo-res image and we expect the first 1414 tokens to be all pads
output_eval = model(**inputs_batched, output_hidden_states=True)
self.assertTrue((output_eval.hidden_states[0][0, :1414, ...] == 0).all().item())
# otherwise padding is on the right side, so it's last 1414 tokens
self.processor.padding_side = "right"
inputs_batched = self.processor(
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True
).to(torch_device)
model.train()
with torch.no_grad():
output_train = model(**inputs_batched, output_hidden_states=True)
self.assertTrue((output_train.hidden_states[0][0, -1414:, ...] == 0).all().item())
)?

src/transformers/models/llava_next/modeling_llava_next.py Outdated Show resolved Hide resolved
src/transformers/models/llava_next/modeling_llava_next.py Outdated Show resolved Hide resolved
@zucchini-nlp zucchini-nlp mentioned this pull request Aug 6, 2024
5 tasks
@alyakin314
Copy link

my traceback in #29850 . two comments:

  1. technically doesn't matter in case of batch_size = 1. though warning probably won't hurt anyone much.
  2. there still can be the case, by error or otherwise where model.padding_side != processor.tokenizer.padding_side, which will often break with the traceback as in mine and in @jp1924 case, which is not very helpful and points in a weird direction. worth adding to the ValueError message? @jp1924 see pr to ur fork.

@zucchini-nlp
Copy link
Member

I am not very pro of raising a ValueError, because that means we limit users to using only padding "left" or "right" in train/inference. As per the prev discussion, some users want to train with "left" padding, so we better give freedom to choose padding side.

For the potential model.padding_side != processor.tokenizer.padding_side, there is some info in the model docs, but not everyone would look there. So, we can get rid of the "training" in if batch_size>1 and not self.training and infer padding side from the inputs in all cases. After inferring we can raise warnings if the inferred padding doesn't match the convention that inference=left and training=right.

@alyakin314
Copy link

not sure we are talking about the same ValueError. i was referring to the specific error that will get raised if the inputs were prepared with a different padding from what model expects. i.e. the error in the traceback in original pr comment (line 626 of the code). whereas you seem to be talking about raising an error if the side is not what is 'conventional'

but potentially i misunderstood you?

@zucchini-nlp
Copy link
Member

@alyakin314 Got you! The way you suggest it is to add more details to error message, which I didn't see sorry. Yes, that totally works, just needs different wording imo

We still should try to infer padding when not training though, that is related to the current PR which is changing the way padding behaves for training mode. This should guide users to easier debugging

@alyakin314
Copy link

alyakin314 commented Aug 6, 2024

different wording imo

word explanation without the variable names?

We still should try to infer padding when not training though, that is related to the current PR which is changing the way padding behaves for training mode. This should guide users to easier debugging

that is probably true. presumably by inferring which side has a token == self.pad_token_id? there are cases when neither side has these (e.g. all are truncated), but that case wouldn't cause issues anyway, if i understand correctly. is there any other edge case that breaks this?

@zucchini-nlp
Copy link
Member

I meant smth like this

"This prevents correct indexing and breaks batch generation."
"Make sure you have same amount of special image tokens as there are images and that model.padding_side == processor.tokenizer.padding_side"

is there any other edge case that breaks this?
No, afaik the only edge case is that. This misunderstanding started because we expected users to use right padding when training, which came out to be not the case for all users

@alyakin314
Copy link

sure, changed the wording in the pr to the branch this pr is from (that's my only way of editing it without creating a separate pr, right?).

tbh, the only practical difference catching whether the pad token are on the 'wrong' side is that there will be an exception raised at the time of that check that will definitively say that what's wrong is the padding side. as opposed to current behavior where the one my edit is in being raised that says it's either this or that (because as we know - it breaks here anyway). happy to do it, but it's a few lines of extra code for the benefit of just having two separate error messages.

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 7, 2024

I agree with @alyakin314's opinion that it would be good to notify users when there is a mismatch in padding_side between the model and the tokenizer.

As @zucchini-nlp mentioned above, I think it would be better to modify modeling_llava_next.py#L523-L533 as follows to display a warning/error when a mismatch occurs.

I didn't add self.training separately, as this issue seems to occur regardless of whether it's during training or inference.

    _left_padding = torch.any(attention_mask[:, 0] == 0)
    _right_padding = torch.any(attention_mask[:, -1] == 0)

    left_padding = self.padding_side == "left"
    if batch_size > 1:
        # There may be cases where batch_size is 2 or more, but there is no padding; in such cases, pass without else.
        if _left_padding and _right_padding:
            # In the following elif statements, either _left or _right is always False.
            raise ValueError(f"both side of attention_mask has zero, invalid. {attention_mask}")
        elif _right_padding and left_padding:
            "error or warnings"
            "If you want to add a warning, it would be good to set left_padding=False to avoid errors."
        elif _left_padding and not left_padding:
            "error or warnings"
            "If you want to add a warning, it would be good to set left_padding=True to avoid errors."

What do you think about the above modifications, @alyakin314 and @zucchini-nlp?

If I have misunderstood or pointed out something completely different, please let me know.

thank you

@alyakin314
Copy link

alyakin314 commented Aug 7, 2024

thanks @jp1924
any specific reason to use this

    _left_padding = torch.any(attention_mask[:, 0] == 0)
    _right_padding = torch.any(attention_mask[:, -1] == 0)

over something like

    _left_padding = torch.any(input_ids[:, 0] == self.pad_token_id)
    _right_padding = torch.any(input_ids[:, -1] == self.pad_token_id)

?
checking the mask just feels like there can be a case it'll get triggered outside of the model.padding_side == processor.tokenizer.padding_side case if someone uses something unusual attention-wise. i haven't tested either though.

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 7, 2024

@alyakin314

any specific reason to use this

The reason I used attention_mask is partly that the existing code uses it to check padding_side.
More importantly, the _merge_input_ids_with_image_features method is designed to work under the assumption that attention_mask is always provided as input. Therefore, I didn't feel the need to modify it.

checking the mask just feels like there can be a case it'll get triggered outside of the model.padding_side == processor.tokenizer.padding_side case if someone uses something unusual attention-wise. i haven't tested either though.

I agree with the suggestion that it is more reasonable to switch to using input_ids since attention_mask can be optional.
To implement this change, we would need to add code to generate attention_mask
that matches input_ids in case attention_mask is None.
We should consult with @zucchini-nlp about this.

@zucchini-nlp
Copy link
Member

I agree with the suggestion that it is more reasonable to switch to using input_ids since attention_mask can be optional.

Unfortunately the current LLaVa impl doesn't allow one to have no attn mask and will fail in any case down the line. Imo checking input ids is also not perfect and forces us to save pad token in model' config, along with tokenizer. In general, this padding thing is not the best solution, and we'll get rid of the huge merging block soon.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for refining warning messages :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 7, 2024

@zucchini-nlp
What do you think about the idea in this comment?

It seems like a good idea to include a warning when there is a padding_side mismatch between the tokenizer and the model.

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Aug 7, 2024

I didn't add self.training separately, as this issue seems to occur regardless of whether it's during training or inference.

Yes, that is what I meant above as a possible option. We don't actually need to bloat with even more warnings, simple modify if batch_size > 1 and not self.training: -> if batch_size > 1: to infer padding in all cases.

Oh, totally forgot, can you check the test I mentioned is passing and if not adapt it to the new changes?

(

@slow
@require_bitsandbytes
def test_padding_side_when_merging_inputs(self):
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
load_in_4bit=True,
)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
lowres_url = "https://4.img-dpreview.com/files/p/TS560x560~forums/56876524/03975b28741443319e9a94615e35667e"
cats_image = Image.open(requests.get(url, stream=True).raw)
lowres_img = Image.open(requests.get(lowres_url, stream=True).raw)
inputs_batched = self.processor(
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True
).to(torch_device)
# model is in eval mode by default so we should get pad on the left side
# we can check the first hidden-states (aka inputs embeds)
# the first element was lo-res image and we expect the first 1414 tokens to be all pads
output_eval = model(**inputs_batched, output_hidden_states=True)
self.assertTrue((output_eval.hidden_states[0][0, :1414, ...] == 0).all().item())
# otherwise padding is on the right side, so it's last 1414 tokens
self.processor.padding_side = "right"
inputs_batched = self.processor(
[self.prompt, self.prompt], images=[lowres_img, cats_image], return_tensors="pt", padding=True
).to(torch_device)
model.train()
with torch.no_grad():
output_train = model(**inputs_batched, output_hidden_states=True)
self.assertTrue((output_train.hidden_states[0][0, -1414:, ...] == 0).all().item())
)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this to make sure the behaviour is expected! Reviewing as @LysandreJik is off

logger.warning_once(
"Padding side is set to 'left' but the model is in training mode. For training "
"it is recommended to set `model.padding_side='right' and `processor.tokenizer.padding_side='right'`. "
"If that's intended, ignore this warning"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how we get on here and if anyone complains - emitting warnings when there's no action to take can be annoying for users

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but isn't this whole field just - seeing warnings that you can't do anything about in your logs during init??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what you mean by "whole field".

In general warnings should be used sparingly and there should be something the user can do to remove them -- this applies most of the time here -- so I think it's OK. I don't expect this to cause much issue - but experience has shown that people get annoyed if there's sudden, new warning messages they can't remove. Using warning_once should mitigate some of this

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 13, 2024

@zucchini-nlp, there's a conflict between the test code I wrote and the test code in the main branch. What should I do?

Oh, and could you check if there's any issue with the code I worked on? If there is, please let me know, and I'll fix it.

@jp1924
Copy link
Contributor Author

jp1924 commented Aug 14, 2024

Oops...... @zucchini-nlp @amyeroberts, @alyakin314, my apologies. Embarrassingly, while fiddling with the settings to resolve the conflict, the PR was closed. I've reopened the #32679, can you check it out?

@alyakin314 And if you don't mind, could you do a PR on the Fix_llava_next_bug branch again regarding the warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants