Closed as not planned
Description
Steps to reproduce
- Run Llama 3.2
type: service
name: llama32
image: vllm/vllm-openai:latest
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
- MAX_MODEL_LEN=4096
- MAX_NUM_SEQS=8
commands:
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--max-num-seqs $MAX_NUM_SEQS
--enforce-eager
--disable-log-requests
--limit-mm-per-prompt "image=1"
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
model: meta-llama/Llama-3.2-11B-Vision-Instruct
resources:
gpu: 40GB..48GB
- Access via model endpoint
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
]
}],
"max_tokens": 2048
}'
It doesn't work and throws an error.
- Access the service endpoint:
curl http://127.0.0.1:3000/proxy/services/main/llama32/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
]
}],
"max_tokens": 2048
}'
It works.
Proxy and gateway should support vision requests too (in addition to normal requests).