Skip to content

[Bug]: CUDA error: an illegal memory access was encountered #16398

@jifa513

Description

@jifa513

Your current environment

Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered

🐛 Describe the bug

running Llama-3.3-70B-Instruct-FP8-Dynamic with vllm-0.6.6 , and the server crashed randomly, does anyone has encountered the same problem? And here is the error log:

�[1;36m(VllmWorkerProcess pid=210)�[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
�[1;36m(VllmWorkerProcess pid=210)�[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(VllmWorkerProcess pid=210)�[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(VllmWorkerProcess pid=210)�[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
�[1;36m(VllmWorkerProcess pid=210)�[0;0m WARNING 04-08 01:00:08 model_runner_base.py:143]
INFO 04-08 01:00:08 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250408-010008.pkl...
WARNING 04-08 01:00:08 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 04-08 01:00:08 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 04-08 01:00:08 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 04-08 01:00:08 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING 04-08 01:00:08 model_runner_base.py:143]
ERROR 04-08 01:00:08 engine.py:135] RuntimeError('Error in model execution: Error Internal')
ERROR 04-08 01:00:08 engine.py:135] Traceback (most recent call last):
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 04-08 01:00:08 engine.py:135] return func(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 2028, in execute_model
ERROR 04-08 01:00:08 engine.py:135] hidden_or_intermediate_states = model_executable(
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135] return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 578, in forward
ERROR 04-08 01:00:08 engine.py:135] model_output = self.model(input_ids, positions, kv_caches,
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 168, in call
ERROR 04-08 01:00:08 engine.py:135] return self.forward(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 366, in forward
ERROR 04-08 01:00:08 engine.py:135] hidden_states, residual = layer(positions, hidden_states,
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135] return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 279, in forward
ERROR 04-08 01:00:08 engine.py:135] hidden_states = self.self_attn(positions=positions,
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-08 01:00:08 engine.py:135] return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 204, in forward
ERROR 04-08 01:00:08 engine.py:135] qkv, _ = self.qkv_proj(hidden_states)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-08 01:00:08 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in call_impl
ERROR 04-08 01:00:08 engine.py:135] return forward_call(*args, **kwargs)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 376, in forward
ERROR 04-08 01:00:08 engine.py:135] output_parallel = self.quant_method.apply(self, input
, bias)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 511, in apply
ERROR 04-08 01:00:08 engine.py:135] return scheme.apply_weights(layer, x, bias=bias)
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 135, in apply_weights
ERROR 04-08 01:00:08 engine.py:135] return apply_fp8_linear(
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 120, in apply_fp8_linear
ERROR 04-08 01:00:08 engine.py:135] output = ops.cutlass_scaled_mm(qinput,
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 493, in cutlass_scaled_mm
ERROR 04-08 01:00:08 engine.py:135] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
ERROR 04-08 01:00:08 engine.py:135] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in call
ERROR 04-08 01:00:08 engine.py:135] return self._op(*args, **(kwargs or {}))
ERROR 04-08 01:00:08 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-08 01:00:08 engine.py:135] RuntimeError: Error Internal

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions