[Bug]: CUDA error: an illegal memory access was encountered

### Your current environment

Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered

### 🐛 Describe the bug

running Llama-3.3-70B-Instruct-FP8-Dynamic with vllm-0.6.6 , and the server crashed randomly, does anyone has encountered the same problem? And here is the error log:


[1;36m(VllmWorkerProcess pid=210)[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
[1;36m(VllmWorkerProcess pid=210)[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[1;36m(VllmWorkerProcess pid=210)[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[1;36m(VllmWorkerProcess pid=210)[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[1;36m(VllmWorkerProcess pid=210)[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] 
INFO 04-08 01:00:08 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250408-010008.pkl...
WARNING 04-08 01:00:08 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 04-08 01:00:08 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 04-08 01:00:08 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 04-08 01:00:08 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 04-08 01:00:08 model_runner_base.py:143] 
ERROR 04-08 01:00:08 engine.py:135] RuntimeError('Error in model execution: Error Internal')
ERROR 04-08 01:00:08 engine.py:135] Traceback (most recent call last):
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 04-08 01:00:08 engine.py:135]     return func(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 2028, in execute_model
ERROR 04-08 01:00:08 engine.py:135]     hidden_or_intermediate_states = model_executable(
ERROR 04-08 01:00:08 engine.py:135]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 578, in forward
ERROR 04-08 01:00:08 engine.py:135]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 04-08 01:00:08 engine.py:135]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 04-08 01:00:08 engine.py:135]     return self.forward(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 366, in forward
ERROR 04-08 01:00:08 engine.py:135]     hidden_states, residual = layer(positions, hidden_states,
ERROR 04-08 01:00:08 engine.py:135]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 279, in forward
ERROR 04-08 01:00:08 engine.py:135]     hidden_states = self.self_attn(positions=positions,
ERROR 04-08 01:00:08 engine.py:135]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 204, in forward
ERROR 04-08 01:00:08 engine.py:135]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 04-08 01:00:08 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 376, in forward
ERROR 04-08 01:00:08 engine.py:135]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-08 01:00:08 engine.py:135]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 511, in apply
ERROR 04-08 01:00:08 engine.py:135]     return scheme.apply_weights(layer, x, bias=bias)
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 135, in apply_weights
ERROR 04-08 01:00:08 engine.py:135]     return apply_fp8_linear(
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 120, in apply_fp8_linear
ERROR 04-08 01:00:08 engine.py:135]     output = ops.cutlass_scaled_mm(qinput,
ERROR 04-08 01:00:08 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 493, in cutlass_scaled_mm
ERROR 04-08 01:00:08 engine.py:135]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
ERROR 04-08 01:00:08 engine.py:135]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
ERROR 04-08 01:00:08 engine.py:135]     return self._op(*args, **(kwargs or {}))
ERROR 04-08 01:00:08 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] RuntimeError: Error Internal

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: CUDA error: an illegal memory access was encountered #16398

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: CUDA error: an illegal memory access was encountered #16398

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions