You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
There is a typo in the following lines in LlavaNextProcessor as current_width and current_height are inverted which can cause errors due to miss match of image feature size computed by the processor and by the vision branch in LlavaNextForConditionalGeneration. I encountered this issue while running the following example script.
Here is a code snippet to reproduce the issue:
fromtransformersimportLlavaNextProcessorfromtransformers.models.llava_next.processing_llava_nextimportselect_best_resolutionfromtransformers.models.llava_next.modeling_llava_nextimportunpad_image, get_anyres_image_grid_shapeimporttorchPOSSIBLE_RESOLUTIONS= [
[
336,
672
],
[
672,
336
],
[
672,
672
],
[
1008,
336
],
[
336,
1008
]
]
processor=LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
HEIGHT=500WIDTH=316VISION_MODEL_INPUT_SIZE=336PATCH_SIZE=14PATCH_DIM=VISION_MODEL_INPUT_SIZE//PATCH_SIZE# Reproduce pre-processing steps in the processorheight_best_resolution, width_best_resolution=select_best_resolution(
[HEIGHT, WIDTH], POSSIBLE_RESOLUTIONS
)
scale_height, scale_width=height_best_resolution//VISION_MODEL_INPUT_SIZE, width_best_resolution//VISION_MODEL_INPUT_SIZEpatches_height=VISION_MODEL_INPUT_SIZE//PATCH_SIZEpatches_width=VISION_MODEL_INPUT_SIZE//PATCH_SIZEunpadded_features, newline_features=processor._get_unpadded_features(HEIGHT, WIDTH, patches_height, patches_width, scale_height, scale_width)
num_unpad_features_from_processor=unpadded_features# Reproduce computation of unpadded features in the vision branch# Equivalent to:# https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L676-L684num_patch_height, num_patch_width=get_anyres_image_grid_shape(
(HEIGHT, WIDTH),
POSSIBLE_RESOLUTIONS,
VISION_MODEL_INPUT_SIZE,
)
unpad_features_from_vision=unpad_image(torch.randn(128, num_patch_height*PATCH_DIM, num_patch_width*PATCH_DIM), (HEIGHT, WIDTH))
num_unpad_features_from_vision=unpad_features_from_vision.shape[1] *unpad_features_from_vision.shape[2]
# Should be equalassertnum_unpad_features_from_processor==num_unpad_features_from_vision, f"Not equal: From processor: {num_unpad_features_from_processor}, from vision {num_unpad_features_from_vision}"
Expected behavior
No assertion error.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.45.0.dev0Who can help?
@zu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
There is a typo in the following lines in
LlavaNextProcessor
ascurrent_width
andcurrent_height
are inverted which can cause errors due to miss match of image feature size computed by the processor and by the vision branch inLlavaNextForConditionalGeneration
. I encountered this issue while running the following example script.Here is a code snippet to reproduce the issue:
Expected behavior
No assertion error.
The text was updated successfully, but these errors were encountered: