[Bug]: Issues with max_model_len and config_format mistral

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

</details>


### 🐛 Describe the bug

We have observed crashes and bad results when using the [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) model due to some bug in how `--max-model-len` is handled.

For this model, the [params.json](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/blob/main/params.json) does not have any configuration for the max sequence length. The default value used is 128000 [REF](https://github.com/vllm-project/vllm/blob/de906b95f9d0b9669da902785a9012ac96edd578/vllm/transformers_utils/config.py#L689-L691). But, the HF config in [config.json](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/blob/main/config.json#L17) has `"max_position_embeddings": 131072`.

So we run the OpenAI server with:
```
VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tokenizer-mode mistral  --config-format mistral --load-format mistral --max-num-seqs 4 --max-model-len 131072
```

The server then boots with the expected warning (though it is a little misleading since we aren't using `config.json` and 128000 is the default value not from the config):
> WARNING 05-06 22:11:51 [config.py:3178] User-specified max_model_len (131072) is greater than the derived max_model_len (max_position_embeddings=128000 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.

If we send a request with a context length > 128000 and <131072, eg.
```bash
curl -s -X POST   -H "Content-Type: application/json"   "http://localhost:8000/v1/completions"   --data-binary @- << _EOF
  {
   "model": "mistralai/Mistral-Small-3.1-24B-Instruct-2503",
   "prompt": "$(seq -s ' ' 1 23500)",
   "max_generated_tokens": 8
  }
_EOF
```

then we get a crash with a bunch of CUDA errors like:
```
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
/home/vllm/.cache/vllm/torch_compile_cache/b3db66fc05/rank_0_0/inductor_cache/pi/cpiayc7qwveqsg53w53ihzygo5qggdq4bzprcwvrqyu5eljbgt6z.py:37: unknown: block: [21452,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 128000` failed.
...
ERROR 05-06 22:15:06 [core.py:402]   File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1218, in execute_model
ERROR 05-06 22:15:06 [core.py:402]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-06 22:15:06 [core.py:402]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-06 22:15:06 [core.py:402] RuntimeError: CUDA error: device-side assert triggered
```
(Note that similar errors were reported in https://github.com/vllm-project/vllm/issues/17348 but that was issue was found to be a bad GPU)

If we run the server with `--enforce-eager`, the out of bounds error from torch compilation is avoided and the request succeeds, but the results are gibberish, eg. `"text":"2 7687 7 2139 AL27"`

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bug]: Issues with max_model_len and config_format mistral #17747

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Bug]: Issues with max_model_len and config_format mistral #17747

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions