-
-
Couldn't load subscription status.
- Fork 10.9k
Description
Your current environment
The output of python collect_env.py
🐛 Describe the bug
We have observed crashes and bad results when using the mistralai/Mistral-Small-3.1-24B-Instruct-2503 model due to some bug in how --max-model-len is handled.
For this model, the params.json does not have any configuration for the max sequence length. The default value used is 128000 REF. But, the HF config in config.json has "max_position_embeddings": 131072.
So we run the OpenAI server with:
VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tokenizer-mode mistral --config-format mistral --load-format mistral --max-num-seqs 4 --max-model-len 131072
The server then boots with the expected warning (though it is a little misleading since we aren't using config.json and 128000 is the default value not from the config):
WARNING 05-06 22:11:51 [config.py:3178] User-specified max_model_len (131072) is greater than the derived max_model_len (max_position_embeddings=128000 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
If we send a request with a context length > 128000 and <131072, eg.
curl -s -X POST -H "Content-Type: application/json" "http://localhost:8000/v1/completions" --data-binary @- << _EOF
{
"model": "mistralai/Mistral-Small-3.1-24B-Instruct-2503",
"prompt": "$(seq -s ' ' 1 23500)",
"max_generated_tokens": 8
}
_EOFthen we get a crash with a bunch of CUDA errors like:
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
...
ERROR 05-06 22:15:06 [core.py:402] File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1218, in execute_model
ERROR 05-06 22:15:06 [core.py:402] valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-06 22:15:06 [core.py:402] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-06 22:15:06 [core.py:402] RuntimeError: CUDA error: device-side assert triggered
(Note that similar errors were reported in #17348 but that was issue was found to be a bad GPU)
If we run the server with --enforce-eager, the out of bounds error from torch compilation is avoided and the request succeeds, but the results are gibberish, eg. "text":"2 7687 7 2139 AL27"
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.