Skip to content

[Model] Aya Vision #15441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Apr 1, 2025
Merged

[Model] Aya Vision #15441

merged 51 commits into from
Apr 1, 2025

Conversation

JenZhao
Copy link
Contributor

@JenZhao JenZhao commented Mar 25, 2025

CLOSES #14216

Introduction

This PR introduces support for the Aya Vision models by CohereForAI. Aya Vision models excel in multilingual and multimodal tasks, significantly advancing performance in vision-language understanding.

Supported models:

For more details on Aya Vision training, see: A Deep Dive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

Example Usage

Example inference with single or multiple images:

Single image inference:

python examples/offline_inference/vision_language.py --model_type aya_vision

Multi-image inference:

python examples/offline_inference/vision_language_multi_image.py --model_type aya_vision

Serving Aya Vision 32B model:

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5

JenZhao added 3 commits March 24, 2025 13:50
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Mar 25, 2025
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Mar 27, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
@JenZhao JenZhao marked this pull request as ready for review March 27, 2025 22:01
@JenZhao JenZhao changed the title [Draft] Aya Vision [Model] Aya Vision Mar 27, 2025
@JenZhao
Copy link
Contributor Author

JenZhao commented Mar 27, 2025

@saurabhdash

Ready for review. It took me something to get the num_patches back haha.

Do you know how could I get the max_model_len for both 8b and 32b? Thank you!

@saurabhdash
Copy link

@saurabhdash

I notice the CohereConfig config here

https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167

has the following default

        max_position_embeddings=8192

In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.

Could you please add max_position_embeddings=8192 under text_config as well? Thank you!

It would be great if you could ensure that all configurations under text_config match the language model you are using.

Here is the cohere config we get with the current setup

CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}

Hi!
Thanks for bringing this to my notice! I'll add it.
We should move to the Cohere2 config and architecture because that's what the 8B one uses. It should also enable backward compatibility with the 32B.

Also, #15441 (comment)

JenZhao added 2 commits March 31, 2025 22:50
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
@saurabhdash
Copy link

@saurabhdash

I notice the CohereConfig config here

https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167

has the following default

        max_position_embeddings=8192

In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.

Could you please add max_position_embeddings=8192 under text_config as well? Thank you!

It would be great if you could ensure that all configurations under text_config match the language model you are using.

Here is the cohere config we get with the current setup

CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}

Added the key: https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@JenZhao
Copy link
Contributor Author

JenZhao commented Mar 31, 2025

@saurabhdash
I notice the CohereConfig config here
https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167
has the following default

        max_position_embeddings=8192

In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.
Could you please add max_position_embeddings=8192 under text_config as well? Thank you!
It would be great if you could ensure that all configurations under text_config match the language model you are using.
Here is the cohere config we get with the current setup

CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}

Added the key: https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

thank you! testing it now

@JenZhao
Copy link
Contributor Author

JenZhao commented Mar 31, 2025

@saurabhdash
Copy link

saurabhdash commented Mar 31, 2025

https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@saurabhdash just to confirm the max_position_embeddings is supposed to be 8192 with the 16k context window?

https://huggingface.co/CohereForAI/aya-vision-32b/blob/main/config.json#L22 https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

IIrc, back when we merged the first commandR model, we wanted to enable 128k context on vLLM while keeping it capped to 8k for HF -- thus we came about the 2 different lengths. I believe model_max_length should be the one to be used for longer context lengths?

@JenZhao
Copy link
Contributor Author

JenZhao commented Mar 31, 2025

https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@saurabhdash just to confirm the max_position_embeddings is supposed to be 8192 with the 16k context window?
https://huggingface.co/CohereForAI/aya-vision-32b/blob/main/config.json#L22 https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

IIrc, back when we merged the first commandR model, we wanted to enable 128k context on vLLM while keeping it capped to 8k for HF -- thus we came about the 2 different lengths. I believe model_max_length should be the one to be used for longer context lengths?

in that case I think this vllm logic needs to be updated

vllm/vllm/config.py

Lines 2693 to 2719 in f98a492

derived_max_model_len = float("inf")
possible_keys = [
# OPT
"max_position_embeddings",
# GPT-2
"n_positions",
# MPT
"max_seq_len",
# ChatGLM2
"seq_length",
# Command-R
"model_max_length",
# Whisper
"max_target_positions",
# Others
"max_sequence_length",
"max_seq_length",
"seq_len",
]
# Choose the smallest "max_length" from the possible keys.
max_len_key = None
for key in possible_keys:
max_len = getattr(hf_config, key, None)
if max_len is not None:
max_len_key = key if max_len < derived_max_model_len \
else max_len_key
derived_max_model_len = min(derived_max_model_len, max_len)
cc @ywang96 @DarkLight1337 Could this to be per model based? eg cohere will be using model_max_length only

@JenZhao
Copy link
Contributor Author

JenZhao commented Apr 1, 2025

all updated and tested again
the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5
python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! I left some final nits.

trust_remote_code=True)
messages = [[{
'role': 'user',
'content': f"<image>\n{question}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JenZhao Can we actually change this to the plain prompt? We have it already in the test pipeline too.

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
@JenZhao
Copy link
Contributor Author

JenZhao commented Apr 1, 2025

updated

pytest tests/models/decoder_only/vision_language/test_models.py -k "aya_vision"
===================================== warnings summary =====================================
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case0]
  /home/jovyan/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=235778) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case1]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0566351413726807, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1347601413726807, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9785101413726807, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.0722601413726807, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.4628851413726807, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case1]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0628242492675781, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1253242492675781, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9846992492675781, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062824249267578, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.469074249267578, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0628327131271362, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1253327131271362, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9847077131271362, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062832832336426, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.469082832336426, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0627280473709106, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1252280473709106, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9846030473709106, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062727928161621, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.468977928161621, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red stop sign. It is positioned prominently in the foreground, with its bold white letters spelling out "STOP" clearly visible. The sign is mounted on a pole and stands out against the backdrop of a vibrant, bustling street scene. The street is lined with various shops, restaurants, and buildings, creating a lively urban atmosphere. The stop sign serves as a crucial traffic control device, ensuring the safety of pedestrians and vehicles at this busy intersection.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'        {6740: -0.7726932764053345, 19: -1.1476932764053345, 147117: -1.5851932764053345, 1728: -4.272693157196045, 9832: -6.835193157196045}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. This sign is positioned at the intersection of a street and a pedestrian area, likely indicating that vehicles must come to a complete stop before proceeding. The stop sign is surrounded by a decorative archway with Chinese characters and intricate designs, suggesting that this location might be in a culturally significant or historic area.<|END_RESPONSE|>'       {147117: Logprob(logprob=-1.299865961074829, rank=1, decoded_token='Ġoctagonal'), 6740: Logprob(logprob=-1.362365961074829, rank=2, decoded_token='Ġstop'), 19: Logprob(logprob=-1.690490961074829, rank=3, decoded_token=','), 145073: Logprob(logprob=-2.612365961074829, rank=4, decoded_token='ĠSTOP'), 1728: Logprob(logprob=-2.877990961074829, rank=5, decoded_token='Ġand')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096, 147117, 6740, 3272, 1865, 1690, 5416, 1789, 84939, 9, 7530, 1709, 8044, 19622, 21]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. The sign is positioned on a red and white striped pole, which is located in front of a traditional Chinese-style gate or archway. The archway features intricate designs and Chinese characters, indicating that this is likely an entrance to a cultural or historical site. The stop sign serves as a traffic control measure, ensuring that vehicles and pedestrians stop before entering the area.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'      {1896: -0.5106861591339111, 2708: -0.9481861591339111, 2332: -4.385685920715332, 255022: -10.885685920715332, 40840: -16.32318687438965}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a red and white striped pole, which is located in front of a traditional Chinese-style entrance archway. The archway features intricate designs and is adorned with red lanterns and decorative elements. The stop sign is a prominent feature in the scene, indicating a place where vehicles should come to a complete halt.<|END_RESPONSE|>'    {2708: Logprob(logprob=-0.8206650614738464, rank=1, decoded_token='ĠThis'), 1896: Logprob(logprob=-0.9925400614738464, rank=2, decoded_token='ĠThe'), 2332: Logprob(logprob=-1.8206651210784912, rank=3, decoded_token='ĠIt'), 255022: Logprob(logprob=-3.898790121078491, rank=4, decoded_token='<|END_RESPONSE|>'), 40840: Logprob(logprob=-5.617539882659912, rank=5, decoded_token='ĠBelow')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0619217157363892, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1244217157363892, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9837967157363892, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.0775465965270996, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.4681715965270996, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 4093, 46861, 1709, 1690, 6156, 1801, 14810, 19, 16509, 3531, 1690, 73464, 157637, 4093, 21, 1896, 43361, 31337, 73464, 174821, 1955, 1709, 5003, 87096, 19, 9835, 1671, 31796]
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beautiful floral displays and is celebrated in many cultures around the world.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'      {17130: -0.4074851870536804, 67981: -1.0949852466583252, 6957: -9.157485008239746, 11596: -10.469985008239746, 6844: -15.157485008239746}
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning backdrop against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>' {67981: Logprob(logprob=-0.7247865796089172, rank=1, decoded_token='Ġbackdrop'), 17130: Logprob(logprob=-0.8810365796089172, rank=2, decoded_token='Ġcontrast'), 11596: Logprob(logprob=-3.2404115200042725, rank=3, decoded_token='Ġvisual'), 6957: Logprob(logprob=-3.2404115200042725, rank=4, decoded_token='Ġdisplay'), 6844: Logprob(logprob=-4.427911758422852, rank=5, decoded_token='Ġnatural')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case1]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9063189625740051, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3594439029693604, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4219439029693604, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.5469439029693604, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.5469439029693604, rank=5, decoded_token='Ġrich')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.8973087668418884, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.366058826446533, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.428558826446533, rank=3, decoded_token='Ġprominent'), 12274: Logprob(logprob=-2.553558826446533, rank=4, decoded_token='Ġrich'), 42568: Logprob(logprob=-2.569183826446533, rank=5, decoded_token='Ġstriking')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9059003591537476, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.359025478363037, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.421525478363037, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.546525478363037, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.546525478363037, rank=5, decoded_token='Ġrich')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9052650332450867, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3583900928497314, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4208900928497314, rank=3, decoded_token='Ġprominent'), 12274: Logprob(logprob=-2.5458900928497314, rank=4, decoded_token='Ġrich'), 42568: Logprob(logprob=-2.5615150928497314, rank=5, decoded_token='Ġstriking')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant street scene in an urban area, possibly a Chinatown district. The focal point is a large, ornate red archway adorned with Chinese characters and intricate designs, framing the entrance to a bustling street. A prominent red STOP sign stands out against the backdrop, indicating a busy intersection. The archway is flanked by traditional Chinese statues, adding to the cultural ambiance. The street is lined with various shops and businesses, their signs in English and Chinese, with a prominent Optus advertisement visible. A black SUV is parked on the side of the road, and pedestrians can be seen walking along the'     {15262: -0.5839295387268066, 14159: -0.8339295387268066, 5079: -4.833929538726807, 13438: -13.146429061889648, 1728: -13.708929061889648}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a red, ornate archway adorned with Chinese characters and intricate designs, framing the entrance to what appears to be a cultural or shopping district. A prominent red STOP sign stands out against the archway, indicating a traffic intersection. The background reveals a bustling cityscape with various storefronts, including recognizable brands like Optus and Yes. The street is lined with vehicles, including a black SUV, and pedestrians can be seen walking along the sidewalk. The overall atmosphere suggests a lively, culturally rich urban environment.' {14159: Logprob(logprob=-0.7616890668869019, rank=1, decoded_token='Ġurban'), 15262: Logprob(logprob=-1.0116890668869019, rank=2, decoded_token='Ġstreet'), 5079: Logprob(logprob=-1.9960640668869019, rank=3, decoded_token='Ġcity'), 1728: Logprob(logprob=-4.246064186096191, rank=4, decoded_token='Ġand'), 13438: Logprob(logprob=-4.417939186096191, rank=5, decoded_token='Ġscene')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant street scene in an urban area, possibly a Chinatown district. The focal point is a large, ornate red gate with intricate Chinese architectural details, including a traditional roof and decorative elements. Above the gate, a prominent red STOP sign stands out against the backdrop. The gate is flanked by statues of fierce lions, adding to the cultural significance of the location. The street is bustling with activity, featuring various shops and businesses with signs in both English and Chinese. A black SUV is seen driving past, blending into the urban landscape. The scene is bathed in natural light, highlighting the colors'       {15262: -0.4520649015903473, 14159: -1.014564871788025, 5079: -6.8270649909973145, 1728: -10.389564514160156, 13438: -11.327064514160156}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, framing the entrance to what appears to be a bustling market or cultural district. Above the gate, a large red STOP sign stands out against the backdrop of colorful lanterns and decorative elements. The street is lined with various shops and businesses, their signs in both English and Chinese, indicating a multicultural environment. The presence of a black SUV parked on the side of the road adds a modern touch to the otherwise historic setting. The overall atmosphere is lively,'        {14159: Logprob(logprob=-0.7387250661849976, rank=1, decoded_token='Ġurban'), 15262: Logprob(logprob=-0.9106000661849976, rank=2, decoded_token='Ġstreet'), 5079: Logprob(logprob=-2.660600185394287, rank=3, decoded_token='Ġcity'), 1728: Logprob(logprob=-3.973100185394287, rank=4, decoded_token='Ġand'), 13438: Logprob(logprob=-4.035600185394287, rank=5, decoded_token='Ġscene')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9063189625740051, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3594439029693604, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4219439029693604, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.5469439029693604, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.5469439029693604, rank=5, decoded_token='Ġrich')}
    comparator(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================ 8 passed, 194 deselected, 22 warnings in 394.94s (0:06:34) ================

@JenZhao
Copy link
Contributor Author

JenZhao commented Apr 1, 2025

python examples/offline_inference/vision_language_multi_image.py --
model_type aya_vision
Processed prompts: 100%|█████████████| 1/1 [00:01<00:00,  1.26s/it, est. speed input: 2371.60 toks/s, output: 101.97 toks/s]
<|START_RESPONSE|>**Image 1:**
The first image features a vibrant green-headed mallard duck gliding gracefully on a serene blue body of water. The duck's distinctive features, including its glossy emerald head, bright yellow beak, and white neck ring, are clearly visible. The surrounding water reflects the duck's image, creating a mirror-like effect. The peaceful scene captures the beauty of nature and the elegance of this aquatic bird.

**Image 2:**
In contrast, the second image showcases a majestic lion in a vast grassland. The lion's powerful presence is emphasized by its flowing golden mane, which stands out against the backdrop
python examples/offline_inference/vision_language.py --model_type aya_vision
Processed prompts: 100%|█████████████| 4/4 [00:01<00:00,  3.39it/s, est. speed input: 4141.99 toks/s, output: 217.28 toks/s]
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers adorning the branches of a tree. The tree's branches are intricately woven, creating a natural frame for the scene. In the background, a tall and slender tower stands out against the clear blue sky. The tower, with
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers covering the branches of a tree. The tree's branches are intertwined, creating a natural frame for the scene. In the background, a tall white tower with a sleek design stands out against the vibrant blue sky. The tower's structure
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers adorning the branches of a tree. The tree's branches are intertwined, creating a natural frame that leads the eye towards the iconic Tokyo Skytree in the background. The Skytree, a distinctive white structure with a lattice-like
<|START_RESPONSE|>The image captures a stunning scene of cherry blossoms in full bloom, creating a vibrant pink canopy against a clear blue sky. The blossoms are densely packed and fill the frame, with their delicate petals creating a soft, ethereal atmosphere. Interspersed among the blossoms are the slender branches of the cherry trees, their dark green foliage

@ywang96 ywang96 enabled auto-merge (squash) April 1, 2025 02:48
@JenZhao
Copy link
Contributor Author

JenZhao commented Apr 1, 2025

ci needs the aya model access

[2025-04-01T02:59:05Z] E                   Cannot access gated repo for url https://huggingface.co/CohereForAI/aya-vision-8b/resolve/main/config.json.
[2025-04-01T02:59:05Z] E                   Access to model CohereForAI/aya-vision-8b is restricted and you are not in the authorized list. Visit https://huggingface.co/CohereForAI/aya-vision-8b to ask for access.

@ywang96 ywang96 disabled auto-merge April 1, 2025 07:40
@ywang96 ywang96 enabled auto-merge (squash) April 1, 2025 07:41
@saurabhdash
Copy link

all updated and tested again the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5
python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

I can verify this today to make sure things look good! Looking at the generations, things should be okay but would be nice to confirm.

@ywang96 ywang96 merged commit 38327cf into vllm-project:main Apr 1, 2025
42 checks passed
@JenZhao
Copy link
Contributor Author

JenZhao commented Apr 1, 2025

all updated and tested again the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5
python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

I can verify this today to make sure things look good! Looking at the generations, things should be okay but would be nice to confirm.

Thank you! Please let me know if you notice any discrepancies or regression in the metrics.

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: aya 32b vision support
4 participants