Skip to content

Conversation

@sts07142
Copy link

@sts07142 sts07142 commented Sep 8, 2025

Purpose

This PR fixes a RuntimeError that occurs when an engine worker process dies unexpectedly.
The MPClientEngineMonitor thread, which has no active event loop, calls the BackgroundResources finalizer for cleanup.
This caused self.output_socket._get_loop() to raise a RuntimeError, crashing the main server process.

This PR handles the exception by wrapping the loop retrieval in a try-except-else block.
If getting the loop fails, it now performs a best-effort synchronous cleanup of sockets and logs a warning, preventing the crash and allowing for a controlled shutdown.

Resolves #24230 #24305

Test Plan

  1. Run any model
  • In my Case: Llama-3.1-8B-Instruct
uv run vllm serve meta-llama/Llama-3.1-8B-Instruct --tool-call-parser llama3_json --chat-template custom/tool_chat_template_llama3.1_json.jinja --enable-auto-tool-choice
  1. Kill your EngineCore_0
INFO 09-08 16:19:21 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=769888) INFO 09-08 16:19:23 [api_server.py:1805] vLLM API server version 0.10.1.1
(APIServer pid=769888) INFO 09-08 16:19:23 [utils.py:326] non-default args: {'model_tag': 'meta-llama/Llama-3.1-8B-Instruct', 'chat_template': 'custom/tool_chat_template_llama3.1_json.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'llama3_json', 'model': 'meta-llama/Llama-3.1-8B-Instruct'}
(APIServer pid=769888) INFO 09-08 16:19:28 [__init__.py:711] Resolved architecture: LlamaForCausalLM
(APIServer pid=769888) INFO 09-08 16:19:28 [__init__.py:1750] Using max model len 131072
(APIServer pid=769888) INFO 09-08 16:19:28 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-08 16:19:31 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=770256) INFO 09-08 16:19:33 [core.py:636] Waiting for init message from front-end.
... 
kill -9 770256

Test Result

As-Is

  • The API server process would crash with an unhandled exception.
  • Furthermore, the process would hang after printing the error and would not terminate completely without a manual interruption (e.g., Ctrl+C).
(APIServer pid=769888) ERROR 09-08 16:21:47 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
(APIServer pid=769888) /usr/lib/python3.12/weakref.py:590: RuntimeWarning: No running event loop. zmq.asyncio should be used from within an asyncio loop.
(APIServer pid=769888)   return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=769888) Exception in thread MPClientEngineMonitor:
(APIServer pid=769888) Traceback (most recent call last):
(APIServer pid=769888)   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
(APIServer pid=769888)     self.run()
(APIServer pid=769888)   File "/usr/lib/python3.12/threading.py", line 1010, in run
(APIServer pid=769888)     self._target(*self._args, **self._kwargs)
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 565, in monitor_engine_cores
(APIServer pid=769888)     _self.shutdown()
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 517, in shutdown
(APIServer pid=769888)     self._finalizer()
(APIServer pid=769888)   File "/usr/lib/python3.12/weakref.py", line 590, in __call__
(APIServer pid=769888)     return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 350, in __call__
(APIServer pid=769888)     loop = self.output_socket._get_loop()
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/zmq/_future.py", line 59, in _get_loop
(APIServer pid=769888)     current_loop = self._default_loop()
(APIServer pid=769888)                    ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/zmq/asyncio.py", line 116, in _default_loop
(APIServer pid=769888)     return asyncio.get_event_loop()
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/usr/lib/python3.12/asyncio/events.py", line 702, in get_event_loop
(APIServer pid=769888)     raise RuntimeError('There is no current event loop in thread %r.'
(APIServer pid=769888) RuntimeError: There is no current event loop in thread 'MPClientEngineMonitor'.
^C(APIServer pid=769888) INFO 09-08 16:22:04 [launcher.py:101] Shutting down FastAPI HTTP server.
(APIServer pid=769888) INFO:     Shutting down
(APIServer pid=769888) INFO:     Waiting for application shutdown.
(APIServer pid=769888) INFO:     Application shutdown complete.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

To-Be

  • It now correctly handles the exception, logs the expected warning, and initiates a graceful shutdown as shown in the logs
(APIServer pid=773614) ERROR 09-08 16:24:02 [core_client.py:586] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
(APIServer pid=773614) /usr/lib/python3.12/weakref.py:590: RuntimeWarning: No running event loop. zmq.asyncio should be used from within an asyncio loop.
(APIServer pid=773614)   return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=773614) WARNING 09-08 16:24:02 [core_client.py:362] Could not get event loop for async cleanup. Tasks may not be cancelled, sockets will be closed.
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]     outputs = await engine_core.get_output_async()
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 867, in get_output_async
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]     raise self._format_exception(outputs) from None
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=773614) INFO:     Shutting down
(APIServer pid=773614) INFO:     Waiting for application shutdown.
(APIServer pid=773614) INFO:     Application shutdown complete.
(APIServer pid=773614) INFO:     Finished server process [773614]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a RuntimeError that occurs when an engine worker process dies unexpectedly. The root cause, a cleanup finalizer being called from a thread without an active event loop, is correctly identified. The solution of wrapping the event loop retrieval in a try...except block is robust and appropriate. In the absence of an event loop, the new logic performs a best-effort synchronous cleanup of sockets, preventing the server crash and allowing for a graceful shutdown. The changes are well-implemented and the accompanying description and test results clearly demonstrate the fix. I have no further recommendations as the code looks good.

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: injaeryou <sts07142@naver.com>
Signed-off-by: Injae Ryou <sts07142@naver.com>
@sts07142 sts07142 force-pushed the fix/runtime-error-in-mpclient-engine-monitor branch from 3496407 to d87ee38 Compare September 8, 2025 10:34
@njhill njhill self-assigned this Sep 8, 2025
@njhill
Copy link
Member

njhill commented Sep 9, 2025

Thanks for this @sts07142! I have opened a PR with a slightly different fix but included you as co-author: #24540

@sts07142
Copy link
Author

sts07142 commented Sep 9, 2025

Thanks for this @sts07142! I have opened a PR with a slightly different fix but included you as co-author: #24540

#24540 is cleaner and works better than mine.
Thank you for adding me as a co-author!

@sts07142 sts07142 closed this Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: RuntimeError: There is no current event loop in thread 'MPClientEngineMonitor'.

2 participants