Skip to content

Mllama ignores input image when deployed in triton #692

Open
@mutkach

Description

@mutkach

System Info

cpu: x86_64
mem: 128G
gpu: H100 80G
docker: tritonserver:24.12-trtllm-python-py3
Cuda: 12.6
Driver: 535.216.01
TensorRT: 10.7.0
TensorRT-LLM: v0.16.0

Who can help?

@kaiyux @byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:
Using scripts for Mllama build and deployment multimodal.md:
except:

  • use Visual-Instruct-11B instead of Visual-11B
  • set max_encoder_input_len to 6404 for Visual-Instruct-11B as indicated by tensorRT-LLM guide
  • set 1 batch size for testing purposes
  • checkout v0.16.0 tag for TensorRT-LLM (there's discrepancies when converting checkpoint otherwise)
  • fill in cross_kv_cache_fraction in config.pbtxt - 0.5 (it won't start in triton otherwise)
  • starting triton manually with a command
  • load ensemble model (e2e setup would not work otherwise)

triton command is tritonserver --model-repository=multimodal_ifb --model-control-mode=explicit --log-verbose=3 --load-model=tensorrt_llm --load-model=multimodal_encoders --load-model=ensemble --load-model=tensorrt_llm_bls --cuda-memory-pool-byte-size=0:300000000

Expected behavior

When tested with ...

python3 tensorrt_llm/examples/multimodal/run.py --visual_engine_dir /tmp/mllama/trt_engines/encoder/ \
                                   --visual_engine_name visual_encoder.engine \
                                   --llm_engine_dir /tmp/mllama/trt_engines/decoder/ \
                                   --hf_model_dir Llama-3.2-11B-Vision-Instruct/ \
                                   --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
                                   --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
                                   --max_new_tokens 50 \
                                   --batch_size 1

output:

", it would be:.\\nA rabbit in a coat.\\nA charming and dapper fellow.\\nHe's a stylish chap indeed. <OCR/> ርርርርርር

Works as expected.

actual behavior

When run with:

python3 tools/multimodal/client.py  --model_type mllama --text "<|image|><|begin_of_text|>If I had to write a haiku for this one" --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffuser
s/rabbit.jpg --top-p 0.7 --temperature 0.9 --top-k 40 --request-output-len 20

the result is:

[beam 0 ]:
<|image|><|begin_of_text|>If I had to write a haiku for this one, it would be:
“Golden sunsets fade
Gone, yet memories remain
Summer's

When shown a different image or provided different runtime parameters would similarly ignore image content (different image -> same output).

additional notes

Double-checked tokenization output and also I checked that image inputs are sent correctly (image_bytes) and verified encoder_input_features and cross_attention_masks (multimodal_encoder outputs) are in the same ballpark (though not same or equal by any means) when run with tensorrt_llm/examples/multimodal/run.py

encoder_input_features in triton:

tensor([[[  8.1875,  12.3750,  -4.5938,  ..., -12.1875,  -4.4062,   5.1250],
         [ -1.0625,  13.5000,   7.4375,  ...,  -2.3125,  -3.0625, -13.2500],
         [-12.5000,   7.0625,   8.5625,  ...,   3.1875,  -0.1836,  -8.4375],
         ...,
         [ -3.8906,  -2.5625,  -6.0938,  ...,  -2.2812,  -8.1875,  -3.0312],
         [  2.7031,   7.0938,  -7.6875,  ...,  -8.5625,  -4.4062, -22.2500],
         [  4.2500,   1.2734,   1.5156,  ...,  -1.8359,  -2.5312,   1.5625]]],
       device='cuda:0', dtype=torch.bfloat16)

in tensorrt_llm runner

tensor([[  8.1250,  12.3750,  -4.6875,  ..., -12.1875,  -4.3438,   5.2500],
        [ -1.1328,  13.3125,   7.4062,  ...,  -2.6250,  -2.9531, -13.1875],
        [-12.3750,   6.9688,   8.5625,  ...,   2.9688,  -0.2139,  -8.6250],
        ...,
        [ -5.4375,  -2.8125,  -6.9375,  ...,  -3.4375,  -7.8125,  -3.7969],
        [  1.1641,   6.9062,  -3.5000,  ...,  -3.0625,  -2.9688, -27.2500],
        [  4.6562,   1.3906,   1.6953,  ...,  -1.6484,  -2.9375,   1.3281]],
       device='cuda:0', dtype=torch.bfloat16)

If that difference is not ok, should I look into that?
Aside of that, the bls setting also not working. The LLM itself seems to be working fine and giving correct responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions