[Model] Aya Vision #15441

JenZhao · 2025-03-25T06:04:34Z

Introduction

This PR introduces support for the Aya Vision models by CohereForAI. Aya Vision models excel in multilingual and multimodal tasks, significantly advancing performance in vision-language understanding.

Supported models:

For more details on Aya Vision training, see: A Deep Dive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

Example Usage

Example inference with single or multiple images:

Single image inference:

python examples/offline_inference/vision_language.py --model_type aya_vision

Multi-image inference:

python examples/offline_inference/vision_language_multi_image.py --model_type aya_vision

Serving Aya Vision 32B model:

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

github-actions · 2025-03-25T06:04:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

JenZhao · 2025-03-27T22:09:27Z

@saurabhdash

Ready for review. It took me something to get the num_patches back haha.

Do you know how could I get the max_model_len for both 8b and 32b? Thank you!

tests/models/decoder_only/vision_language/test_models.py

tests/models/multimodal/processing/test_common.py

tests/models/registry.py

saurabhdash · 2025-03-31T22:19:48Z

@saurabhdash

I notice the CohereConfig config here

https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167

has the following default
        max_position_embeddings=8192
In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.

Could you please add max_position_embeddings=8192 under text_config as well? Thank you!

It would be great if you could ensure that all configurations under text_config match the language model you are using.

Here is the cohere config we get with the current setup
CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}

Hi!
Thanks for bringing this to my notice! I'll add it.
We should move to the Cohere2 config and architecture because that's what the 8B one uses. It should also enable backward compatibility with the 32B.

Also, #15441 (comment)

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

saurabhdash · 2025-03-31T23:43:50Z

@saurabhdash

I notice the CohereConfig config here

https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167

has the following default
        max_position_embeddings=8192
In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.

Could you please add max_position_embeddings=8192 under text_config as well? Thank you!

It would be great if you could ensure that all configurations under text_config match the language model you are using.

Here is the cohere config we get with the current setup
CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}

Added the key: https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

JenZhao · 2025-03-31T23:46:34Z

@saurabhdash
I notice the CohereConfig config here
https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L160-L167
has the following default
        max_position_embeddings=8192
In vllm, we are using this section to determine the maximum model length. Since max_position_embeddings=8192 is smaller than model_max_length=16384, it ends up defaulting to 8192.
Could you please add max_position_embeddings=8192 under text_config as well? Thank you!
It would be great if you could ensure that all configurations under text_config match the language model you are using.
Here is the cohere config we get with the current setup
CohereConfig {
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "eos_token_id": 255001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "layer_norm_eps": 1e-05,
  "logit_scale": 0.0625,
  "max_position_embeddings": 8192,
  "model_max_length": 16384,
  "model_type": "cohere",
  "num_attention_heads": 64,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rope_scaling": null,
  "rope_theta": 4000000,
  "torch_dtype": "float16",
  "transformers_version": "4.51.0.dev0",
  "use_cache": true,
  "use_qk_norm": false,
  "vocab_size": 256000
}
Added the key: https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

thank you! testing it now

JenZhao · 2025-03-31T23:52:22Z

https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@saurabhdash just to confirm the max_position_embeddings is supposed to be 8192 with the 16k context window?

https://huggingface.co/CohereForAI/aya-vision-32b/blob/main/config.json#L22
https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

saurabhdash · 2025-03-31T23:56:28Z

https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@saurabhdash just to confirm the max_position_embeddings is supposed to be 8192 with the 16k context window?

https://huggingface.co/CohereForAI/aya-vision-32b/blob/main/config.json#L22 https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

IIrc, back when we merged the first commandR model, we wanted to enable 128k context on vLLM while keeping it capped to 8k for HF -- thus we came about the 2 different lengths. I believe model_max_length should be the one to be used for longer context lengths?

JenZhao · 2025-03-31T23:59:46Z

https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

@saurabhdash just to confirm the max_position_embeddings is supposed to be 8192 with the 16k context window?
https://huggingface.co/CohereForAI/aya-vision-32b/blob/main/config.json#L22 https://huggingface.co/CohereForAI/aya-vision-8b/blob/main/config.json#L26

IIrc, back when we merged the first commandR model, we wanted to enable 128k context on vLLM while keeping it capped to 8k for HF -- thus we came about the 2 different lengths. I believe model_max_length should be the one to be used for longer context lengths?

in that case I think this vllm logic needs to be updated

vllm/vllm/config.py

Lines 2693 to 2719 in f98a492

    
           derived_max_model_len = float("inf") 
        
           possible_keys = [ 
        
               # OPT 
        
               "max_position_embeddings", 
        
               # GPT-2 
        
               "n_positions", 
        
               # MPT 
        
               "max_seq_len", 
        
               # ChatGLM2 
        
               "seq_length", 
        
               # Command-R 
        
               "model_max_length", 
        
               # Whisper 
        
               "max_target_positions", 
        
               # Others 
        
               "max_sequence_length", 
        
               "max_seq_length", 
        
               "seq_len", 
        
           ] 
        
           # Choose the smallest "max_length" from the possible keys. 
        
           max_len_key = None 
        
           for key in possible_keys: 
        
               max_len = getattr(hf_config, key, None) 
        
               if max_len is not None: 
        
                   max_len_key = key if max_len < derived_max_model_len \ 
        
                       else max_len_key 
        
                   derived_max_model_len = min(derived_max_model_len, max_len)

cc @ywang96 @DarkLight1337 Could this to be per model based? eg cohere will be using model_max_length only

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

JenZhao · 2025-04-01T01:33:32Z

all updated and tested again
the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5

python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

ywang96

Thanks for the great work! I left some final nits.

examples/offline_inference/vision_language.py

tests/models/registry.py

vllm/config.py

ywang96 · 2025-04-01T01:41:46Z

examples/offline_inference/vision_language.py

+                                              trust_remote_code=True)
+    messages = [[{
+        'role': 'user',
+        'content': f"<image>\n{question}"


@JenZhao Can we actually change this to the plain prompt? We have it already in the test pipeline too.

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

JenZhao · 2025-04-01T02:04:17Z

updated

pytest tests/models/decoder_only/vision_language/test_models.py -k "aya_vision"

===================================== warnings summary =====================================
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case0]
  /home/jovyan/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=235778) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case1]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0566351413726807, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1347601413726807, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9785101413726807, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.0722601413726807, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.4628851413726807, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case1]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0628242492675781, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1253242492675781, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9846992492675781, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062824249267578, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.469074249267578, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0628327131271362, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1253327131271362, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9847077131271362, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062832832336426, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.469082832336426, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0627280473709106, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1252280473709106, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9846030473709106, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.062727928161621, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.468977928161621, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case2]
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>'
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red stop sign. It is positioned prominently in the foreground, with its bold white letters spelling out "STOP" clearly visible. The sign is mounted on a pole and stands out against the backdrop of a vibrant, bustling street scene. The street is lined with various shops, restaurants, and buildings, creating a lively urban atmosphere. The stop sign serves as a crucial traffic control device, ensuring the safety of pedestrians and vehicles at this busy intersection.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'        {6740: -0.7726932764053345, 19: -1.1476932764053345, 147117: -1.5851932764053345, 1728: -4.272693157196045, 9832: -6.835193157196045}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. This sign is positioned at the intersection of a street and a pedestrian area, likely indicating that vehicles must come to a complete stop before proceeding. The stop sign is surrounded by a decorative archway with Chinese characters and intricate designs, suggesting that this location might be in a culturally significant or historic area.<|END_RESPONSE|>'       {147117: Logprob(logprob=-1.299865961074829, rank=1, decoded_token='Ġoctagonal'), 6740: Logprob(logprob=-1.362365961074829, rank=2, decoded_token='Ġstop'), 19: Logprob(logprob=-1.690490961074829, rank=3, decoded_token=','), 145073: Logprob(logprob=-2.612365961074829, rank=4, decoded_token='ĠSTOP'), 1728: Logprob(logprob=-2.877990961074829, rank=5, decoded_token='Ġand')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096, 147117, 6740, 3272, 1865, 1690, 5416, 1789, 84939, 9, 7530, 1709, 8044, 19622, 21]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. The sign is positioned on a red and white striped pole, which is located in front of a traditional Chinese-style gate or archway. The archway features intricate designs and Chinese characters, indicating that this is likely an entrance to a cultural or historical site. The stop sign serves as a traffic control measure, ensuring that vehicles and pedestrians stop before entering the area.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'      {1896: -0.5106861591339111, 2708: -0.9481861591339111, 2332: -4.385685920715332, 255022: -10.885685920715332, 40840: -16.32318687438965}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a red and white striped pole, which is located in front of a traditional Chinese-style entrance archway. The archway features intricate designs and is adorned with red lanterns and decorative elements. The stop sign is a prominent feature in the scene, indicating a place where vehicles should come to a complete halt.<|END_RESPONSE|>'    {2708: Logprob(logprob=-0.8206650614738464, rank=1, decoded_token='ĠThis'), 1896: Logprob(logprob=-0.9925400614738464, rank=2, decoded_token='ĠThe'), 2332: Logprob(logprob=-1.8206651210784912, rank=3, decoded_token='ĠIt'), 255022: Logprob(logprob=-3.898790121078491, rank=4, decoded_token='<|END_RESPONSE|>'), 40840: Logprob(logprob=-5.617539882659912, rank=5, decoded_token='ĠBelow')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2162, 5055, 1709, 1690, 7524, 1719, 1690, 6156, 1801, 1671, 6096]
  hf:   '<|START_RESPONSE|>The content in the center of the image is a red, octagonal stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is clearly visible and serves as a traffic control device to ensure the safety of pedestrians and vehicles.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>' {19: -0.2919231355190277, 6740: -1.47942316532135, 147117: -4.2919230461120605, 9832: -4.4794230461120605, 1728: -8.604423522949219}
  vllm: '<|START_RESPONSE|>The content in the center of the image is a red stop sign with the word "STOP" written in white letters. This sign is positioned on a post at the intersection of a street and a pedestrian path. The stop sign is a standard traffic control device used to indicate that drivers must come to a complete stop before proceeding. It is an essential element for ensuring road safety and preventing accidents.<|END_RESPONSE|>'    {6740: Logprob(logprob=-1.0619217157363892, rank=1, decoded_token='Ġstop'), 147117: Logprob(logprob=-1.1244217157363892, rank=2, decoded_token='Ġoctagonal'), 19: Logprob(logprob=-1.9837967157363892, rank=3, decoded_token=','), 9832: Logprob(logprob=-3.0775465965270996, rank=4, decoded_token='ĠChinese'), 145073: Logprob(logprob=-3.4681715965270996, rank=5, decoded_token='ĠSTOP')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2162, 4093, 46861, 1709, 1690, 6156, 1801, 14810, 19, 16509, 3531, 1690, 73464, 157637, 4093, 21, 1896, 43361, 31337, 73464, 174821, 1955, 1709, 5003, 87096, 19, 9835, 1671, 31796]
  hf:   '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning contrast against the clear blue sky. This time of year is known for its beautiful floral displays and is celebrated in many cultures around the world.<|END_RESPONSE|><|END_OF_TURN_TOKEN|>'      {17130: -0.4074851870536804, 67981: -1.0949852466583252, 6957: -9.157485008239746, 11596: -10.469985008239746, 6844: -15.157485008239746}
  vllm: '<|START_RESPONSE|>The season depicted in the image is spring, specifically during the cherry blossom season. The vibrant pink cherry blossoms are in full bloom, creating a stunning backdrop against the clear blue sky. This time of year is known for its beauty and is celebrated in many cultures, especially in Japan, where cherry blossoms hold significant cultural and symbolic importance.<|END_RESPONSE|>' {67981: Logprob(logprob=-0.7247865796089172, rank=1, decoded_token='Ġbackdrop'), 17130: Logprob(logprob=-0.8810365796089172, rank=2, decoded_token='Ġcontrast'), 11596: Logprob(logprob=-3.2404115200042725, rank=3, decoded_token='Ġvisual'), 6957: Logprob(logprob=-3.2404115200042725, rank=4, decoded_token='Ġdisplay'), 6844: Logprob(logprob=-4.427911758422852, rank=5, decoded_token='Ġnatural')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case1]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9063189625740051, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3594439029693604, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4219439029693604, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.5469439029693604, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.5469439029693604, rank=5, decoded_token='Ġrich')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.8973087668418884, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.366058826446533, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.428558826446533, rank=3, decoded_token='Ġprominent'), 12274: Logprob(logprob=-2.553558826446533, rank=4, decoded_token='Ġrich'), 42568: Logprob(logprob=-2.569183826446533, rank=5, decoded_token='Ġstriking')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9059003591537476, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.359025478363037, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.421525478363037, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.546525478363037, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.546525478363037, rank=5, decoded_token='Ġrich')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case2]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9052650332450867, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3583900928497314, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4208900928497314, rank=3, decoded_token='Ġprominent'), 12274: Logprob(logprob=-2.5458900928497314, rank=4, decoded_token='Ġrich'), 42568: Logprob(logprob=-2.5615150928497314, rank=5, decoded_token='Ġstriking')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test0:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant street scene in an urban area, possibly a Chinatown district. The focal point is a large, ornate red archway adorned with Chinese characters and intricate designs, framing the entrance to a bustling street. A prominent red STOP sign stands out against the backdrop, indicating a busy intersection. The archway is flanked by traditional Chinese statues, adding to the cultural ambiance. The street is lined with various shops and businesses, their signs in English and Chinese, with a prominent Optus advertisement visible. A black SUV is parked on the side of the road, and pedestrians can be seen walking along the'     {15262: -0.5839295387268066, 14159: -0.8339295387268066, 5079: -4.833929538726807, 13438: -13.146429061889648, 1728: -13.708929061889648}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a red, ornate archway adorned with Chinese characters and intricate designs, framing the entrance to what appears to be a cultural or shopping district. A prominent red STOP sign stands out against the archway, indicating a traffic intersection. The background reveals a bustling cityscape with various storefronts, including recognizable brands like Optus and Yes. The street is lined with vehicles, including a black SUV, and pedestrians can be seen walking along the sidewalk. The overall atmosphere suggests a lively, culturally rich urban environment.' {14159: Logprob(logprob=-0.7616890668869019, rank=1, decoded_token='Ġurban'), 15262: Logprob(logprob=-1.0116890668869019, rank=2, decoded_token='Ġstreet'), 5079: Logprob(logprob=-1.9960640668869019, rank=3, decoded_token='Ġcity'), 1728: Logprob(logprob=-4.246064186096191, rank=4, decoded_token='Ġand'), 13438: Logprob(logprob=-4.417939186096191, rank=5, decoded_token='Ġscene')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test1:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant street scene in an urban area, possibly a Chinatown district. The focal point is a large, ornate red gate with intricate Chinese architectural details, including a traditional roof and decorative elements. Above the gate, a prominent red STOP sign stands out against the backdrop. The gate is flanked by statues of fierce lions, adding to the cultural significance of the location. The street is bustling with activity, featuring various shops and businesses with signs in both English and Chinese. A black SUV is seen driving past, blending into the urban landscape. The scene is bathed in natural light, highlighting the colors'       {15262: -0.4520649015903473, 14159: -1.014564871788025, 5079: -6.8270649909973145, 1728: -10.389564514160156, 13438: -11.327064514160156}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, framing the entrance to what appears to be a bustling market or cultural district. Above the gate, a large red STOP sign stands out against the backdrop of colorful lanterns and decorative elements. The street is lined with various shops and businesses, their signs in both English and Chinese, indicating a multicultural environment. The presence of a black SUV parked on the side of the road adds a modern touch to the otherwise historic setting. The overall atmosphere is lively,'        {14159: Logprob(logprob=-0.7387250661849976, rank=1, decoded_token='Ġurban'), 15262: Logprob(logprob=-0.9106000661849976, rank=2, decoded_token='Ġstreet'), 5079: Logprob(logprob=-2.660600185394287, rank=3, decoded_token='Ġcity'), 1728: Logprob(logprob=-3.973100185394287, rank=4, decoded_token='Ġand'), 13438: Logprob(logprob=-4.035600185394287, rank=5, decoded_token='Ġscene')}
    comparator(

tests/models/decoder_only/vision_language/test_models.py::test_multi_image_models[aya_vision-test_case3]
  /home/jovyan/vllm/tests/models/decoder_only/vision_language/vlm_utils/core.py:144: UserWarning: Test2:
  Matched tokens:       [255021, 2093, 9046, 228, 24, 13638, 206, 4184, 6156, 62411, 1671, 43361, 14159, 13438, 1865, 1671]
  hf:   '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a prominent red and gold Chinese-style gate at the center. The gate features intricate designs and is adorned with a large red stop sign on a white background, indicating a pedestrian crossing or a designated area for foot traffic. Above the gate, there are traditional Chinese characters and a golden lion statue, symbolizing good luck and protection. The surrounding area is bustling with activity, including various shops and restaurants with colorful signs in English and Chinese. The street is lined with trees and lanterns, creating a festive atmosphere. A black SUV is seen driving on the road, adding to the'    {19186: -0.6753861904144287, 9779: -1.5503861904144287, 12274: -1.9878861904144287, 37941: -2.1753861904144287, 42568: -4.112886428833008}
  vllm: '<|START_RESPONSE|>**Image 1:**\nThis image captures a vibrant urban scene with a traditional Chinese architectural element at its center. The focal point is a striking red gate adorned with intricate designs and Chinese characters, marking the entrance to a cultural or historical district. Above the gate, a large red stop sign stands out against the backdrop of the city. The surrounding area is bustling with activity, featuring various shops and restaurants with colorful signs, including recognizable brands like "Yes" and "Optus." The street is lined with tall buildings, some with traditional Chinese elements, while others are modern and sleek. Pedestrians can be seen walking along the sidewalks'  {9779: Logprob(logprob=-0.9063189625740051, rank=1, decoded_token='Ġtraditional'), 37941: Logprob(logprob=-2.3594439029693604, rank=2, decoded_token='Ġdistinctive'), 19186: Logprob(logprob=-2.4219439029693604, rank=3, decoded_token='Ġprominent'), 42568: Logprob(logprob=-2.5469439029693604, rank=4, decoded_token='Ġstriking'), 12274: Logprob(logprob=-2.5469439029693604, rank=5, decoded_token='Ġrich')}
    comparator(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================ 8 passed, 194 deselected, 22 warnings in 394.94s (0:06:34) ================

JenZhao · 2025-04-01T02:10:03Z

python examples/offline_inference/vision_language_multi_image.py --
model_type aya_vision

Processed prompts: 100%|█████████████| 1/1 [00:01<00:00,  1.26s/it, est. speed input: 2371.60 toks/s, output: 101.97 toks/s]
<|START_RESPONSE|>**Image 1:**
The first image features a vibrant green-headed mallard duck gliding gracefully on a serene blue body of water. The duck's distinctive features, including its glossy emerald head, bright yellow beak, and white neck ring, are clearly visible. The surrounding water reflects the duck's image, creating a mirror-like effect. The peaceful scene captures the beauty of nature and the elegance of this aquatic bird.

**Image 2:**
In contrast, the second image showcases a majestic lion in a vast grassland. The lion's powerful presence is emphasized by its flowing golden mane, which stands out against the backdrop

python examples/offline_inference/vision_language.py --model_type aya_vision

Processed prompts: 100%|█████████████| 4/4 [00:01<00:00,  3.39it/s, est. speed input: 4141.99 toks/s, output: 217.28 toks/s]
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers adorning the branches of a tree. The tree's branches are intricately woven, creating a natural frame for the scene. In the background, a tall and slender tower stands out against the clear blue sky. The tower, with
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers covering the branches of a tree. The tree's branches are intertwined, creating a natural frame for the scene. In the background, a tall white tower with a sleek design stands out against the vibrant blue sky. The tower's structure
<|START_RESPONSE|>The image features a stunning view of cherry blossoms in full bloom, with delicate pink flowers adorning the branches of a tree. The tree's branches are intertwined, creating a natural frame that leads the eye towards the iconic Tokyo Skytree in the background. The Skytree, a distinctive white structure with a lattice-like
<|START_RESPONSE|>The image captures a stunning scene of cherry blossoms in full bloom, creating a vibrant pink canopy against a clear blue sky. The blossoms are densely packed and fill the frame, with their delicate petals creating a soft, ethereal atmosphere. Interspersed among the blossoms are the slender branches of the cherry trees, their dark green foliage

JenZhao · 2025-04-01T05:43:47Z

ci needs the aya model access

[2025-04-01T02:59:05Z] E                   Cannot access gated repo for url https://huggingface.co/CohereForAI/aya-vision-8b/resolve/main/config.json.
[2025-04-01T02:59:05Z] E                   Access to model CohereForAI/aya-vision-8b is restricted and you are not in the authorized list. Visit https://huggingface.co/CohereForAI/aya-vision-8b to ask for access.

saurabhdash · 2025-04-01T12:55:41Z

all updated and tested again the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5

python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

I can verify this today to make sure things look good! Looking at the generations, things should be okay but would be nice to confirm.

JenZhao · 2025-04-01T17:16:25Z

all updated and tested again the model is able to get the 16k length now

vllm serve CohereForAI/aya-vision-32b --disable-log-requests -tp 2 --limit-mm-per-prompt image=5

python -m eval.run eval_vllm --model_name CohereForAI/aya-vision-32b \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp \
        --eval_name "mmmu"
Waiting for VLLM server to come online at http://0.0.0.0:8000/health ...
Timeout is 120s
Waiting for server (0s) ...
Waiting for server (5s) ...
Waiting for server (10s) ...
Waiting for server (15s) ...
Waiting for server (20s) ...
Waiting for server (25s) ...
Server is up!
Loading lmms-lab/MMMU [validation]: 100%|█| 900/9
Querying model: 100%|█| 900/900 [11:04<00:00,  1.
100%|██████████████████████████████████| 900/900 [00:00<00:00, 25590.63it/s]
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.46444444444444444,
    "anywhere_in_answer_relaxed_correctness": 0.46444444444444444
}
================================================================================

@saurabhdash I wonder if we should verify on your eval as well.

I can verify this today to make sure things look good! Looking at the generations, things should be okay but would be nice to confirm.

Thank you! Please let me know if you notice any discrepancies or regression in the metrics.

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

JenZhao added 3 commits March 24, 2025 13:50

update

aca681b

update

a2a58cf

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

hf processor

5b137d8

mergify bot added the frontend label Mar 25, 2025

JenZhao added 13 commits March 25, 2025 06:12

debug

3e915d3

Merge branch 'vllm-project:main' into aya

17a6d45

Merge branch 'vllm-project:main' into aya

3a9f3ec

update

2fa39c8

debug

2f6c79d

debug works now

6e97107

Merge branch 'vllm-project:main' into aya

cf2ac47

remove flatten_2d_lists

2088eea

matched get num patches with huggingface transformer

62a9bb2

clean image processor

534bbc3

revert benchmark change

0d42ae0

update

4295136

add tests

1f6dae5

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

mergify bot added the multi-modality Related to multi-modality (#4194) label Mar 27, 2025

update test

bdd8bd4

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

JenZhao marked this pull request as ready for review March 27, 2025 22:01

JenZhao requested review from DarkLight1337 and ywang96 as code owners March 27, 2025 22:01

JenZhao changed the title ~~[Draft] Aya Vision~~ [Model] Aya Vision Mar 27, 2025

fix

8d4680d

JenZhao force-pushed the aya branch from 47f8904 to 8d4680d Compare March 28, 2025 00:12

DarkLight1337 reviewed Mar 28, 2025

View reviewed changes

tests/models/decoder_only/vision_language/test_models.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Mar 28, 2025

View reviewed changes

tests/models/multimodal/processing/test_common.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Mar 28, 2025

View reviewed changes

tests/models/registry.py Outdated Show resolved Hide resolved

JenZhao added 2 commits March 31, 2025 22:50

update

a624ddc

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

update

fe24d94

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

JenZhao added 2 commits April 1, 2025 00:57

update cohere model max model length setting

ae95b36

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

Merge branch 'vllm-project:main' into aya

4285830

ywang96 approved these changes Apr 1, 2025

View reviewed changes

address comments

ab5a09a

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

ywang96 enabled auto-merge (squash) April 1, 2025 02:48

ywang96 disabled auto-merge April 1, 2025 07:40

ywang96 enabled auto-merge (squash) April 1, 2025 07:41

ywang96 merged commit 38327cf into vllm-project:main Apr 1, 2025
42 checks passed

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Model] Aya Vision (vllm-project#15441)

b8ee887

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model] Aya Vision (vllm-project#15441)

234293e

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>

Uh oh!

[Model] Aya Vision #15441

[Model] Aya Vision #15441

Uh oh!

Conversation

JenZhao commented Mar 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Example Usage

Single image inference:

Multi-image inference:

Serving Aya Vision 32B model:

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

JenZhao commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saurabhdash commented Mar 31, 2025

Uh oh!

saurabhdash commented Mar 31, 2025

Uh oh!

JenZhao commented Mar 31, 2025

Uh oh!

JenZhao commented Mar 31, 2025

Uh oh!

saurabhdash commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JenZhao commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JenZhao commented Apr 1, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywang96 Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

JenZhao commented Apr 1, 2025

Uh oh!

JenZhao commented Apr 1, 2025

Uh oh!

JenZhao commented Apr 1, 2025

Uh oh!

saurabhdash commented Apr 1, 2025

Uh oh!

Uh oh!

JenZhao commented Apr 1, 2025

Uh oh!

Uh oh!

JenZhao commented Mar 25, 2025 •

edited by github-actions bot

Loading

JenZhao commented Mar 27, 2025 •

edited

Loading

saurabhdash commented Mar 31, 2025 •

edited

Loading

JenZhao commented Mar 31, 2025 •

edited

Loading