[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor #24422

sts07142 · 2025-09-08T08:15:48Z

Purpose

This PR fixes a RuntimeError that occurs when an engine worker process dies unexpectedly.
The MPClientEngineMonitor thread, which has no active event loop, calls the BackgroundResources finalizer for cleanup.
This caused self.output_socket._get_loop() to raise a RuntimeError, crashing the main server process.

This PR handles the exception by wrapping the loop retrieval in a try-except-else block.
If getting the loop fails, it now performs a best-effort synchronous cleanup of sockets and logs a warning, preventing the crash and allowing for a controlled shutdown.

Resolves #24230 #24305

Test Plan

Run any model

In my Case: Llama-3.1-8B-Instruct

uv run vllm serve meta-llama/Llama-3.1-8B-Instruct --tool-call-parser llama3_json --chat-template custom/tool_chat_template_llama3.1_json.jinja --enable-auto-tool-choice

Kill your EngineCore_0

INFO 09-08 16:19:21 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=769888) INFO 09-08 16:19:23 [api_server.py:1805] vLLM API server version 0.10.1.1
(APIServer pid=769888) INFO 09-08 16:19:23 [utils.py:326] non-default args: {'model_tag': 'meta-llama/Llama-3.1-8B-Instruct', 'chat_template': 'custom/tool_chat_template_llama3.1_json.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'llama3_json', 'model': 'meta-llama/Llama-3.1-8B-Instruct'}
(APIServer pid=769888) INFO 09-08 16:19:28 [__init__.py:711] Resolved architecture: LlamaForCausalLM
(APIServer pid=769888) INFO 09-08 16:19:28 [__init__.py:1750] Using max model len 131072
(APIServer pid=769888) INFO 09-08 16:19:28 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-08 16:19:31 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=770256) INFO 09-08 16:19:33 [core.py:636] Waiting for init message from front-end.
...

kill -9 770256

Test Result

As-Is

The API server process would crash with an unhandled exception.
Furthermore, the process would hang after printing the error and would not terminate completely without a manual interruption (e.g., Ctrl+C).

(APIServer pid=769888) ERROR 09-08 16:21:47 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
(APIServer pid=769888) /usr/lib/python3.12/weakref.py:590: RuntimeWarning: No running event loop. zmq.asyncio should be used from within an asyncio loop.
(APIServer pid=769888)   return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=769888) Exception in thread MPClientEngineMonitor:
(APIServer pid=769888) Traceback (most recent call last):
(APIServer pid=769888)   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
(APIServer pid=769888)     self.run()
(APIServer pid=769888)   File "/usr/lib/python3.12/threading.py", line 1010, in run
(APIServer pid=769888)     self._target(*self._args, **self._kwargs)
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 565, in monitor_engine_cores
(APIServer pid=769888)     _self.shutdown()
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 517, in shutdown
(APIServer pid=769888)     self._finalizer()
(APIServer pid=769888)   File "/usr/lib/python3.12/weakref.py", line 590, in __call__
(APIServer pid=769888)     return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 350, in __call__
(APIServer pid=769888)     loop = self.output_socket._get_loop()
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/zmq/_future.py", line 59, in _get_loop
(APIServer pid=769888)     current_loop = self._default_loop()
(APIServer pid=769888)                    ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/zmq/asyncio.py", line 116, in _default_loop
(APIServer pid=769888)     return asyncio.get_event_loop()
(APIServer pid=769888)            ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=769888)   File "/usr/lib/python3.12/asyncio/events.py", line 702, in get_event_loop
(APIServer pid=769888)     raise RuntimeError('There is no current event loop in thread %r.'
(APIServer pid=769888) RuntimeError: There is no current event loop in thread 'MPClientEngineMonitor'.
^C(APIServer pid=769888) INFO 09-08 16:22:04 [launcher.py:101] Shutting down FastAPI HTTP server.
(APIServer pid=769888) INFO:     Shutting down
(APIServer pid=769888) INFO:     Waiting for application shutdown.
(APIServer pid=769888) INFO:     Application shutdown complete.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

To-Be

It now correctly handles the exception, logs the expected warning, and initiates a graceful shutdown as shown in the logs

(APIServer pid=773614) ERROR 09-08 16:24:02 [core_client.py:586] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
(APIServer pid=773614) /usr/lib/python3.12/weakref.py:590: RuntimeWarning: No running event loop. zmq.asyncio should be used from within an asyncio loop.
(APIServer pid=773614)   return info.func(*info.args, **(info.kwargs or {}))
(APIServer pid=773614) WARNING 09-08 16:24:02 [core_client.py:362] Could not get event loop for async cleanup. Tasks may not be cancelled, sockets will be closed.
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]     outputs = await engine_core.get_output_async()
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]   File "/home/name/vllm_serve/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 867, in get_output_async
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430]     raise self._format_exception(outputs) from None
(APIServer pid=773614) ERROR 09-08 16:24:02 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=773614) INFO:     Shutting down
(APIServer pid=773614) INFO:     Waiting for application shutdown.
(APIServer pid=773614) INFO:     Application shutdown complete.
(APIServer pid=773614) INFO:     Finished server process [773614]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request effectively addresses a RuntimeError that occurs when an engine worker process dies unexpectedly. The root cause, a cleanup finalizer being called from a thread without an active event loop, is correctly identified. The solution of wrapping the event loop retrieval in a try...except block is robust and appropriate. In the absence of an event loop, the new logic performs a best-effort synchronous cleanup of sockets, preventing the server crash and allowing for a graceful shutdown. The changes are well-implemented and the accompanying description and test results clearly demonstrate the fix. I have no further recommendations as the code looks good.

github-actions · 2025-09-08T08:47:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: injaeryou <sts07142@naver.com>

Signed-off-by: Injae Ryou <sts07142@naver.com>

njhill · 2025-09-09T22:31:18Z

Thanks for this @sts07142! I have opened a PR with a slightly different fix but included you as co-author: #24540

sts07142 · 2025-09-09T23:42:18Z

Thanks for this @sts07142! I have opened a PR with a slightly different fix but included you as co-author: #24540

#24540 is cleaner and works better than mine.
Thank you for adding me as a co-author!

I'll close this PR when [BugFix] Fix async core engine client finalizer #24540 merges

sts07142 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 8, 2025 08:15

mergify bot added the v1 label Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

sts07142 added 2 commits September 8, 2025 19:34

[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor

a2e799a

Signed-off-by: injaeryou <sts07142@naver.com>

typo: pre-commit

d87ee38

Signed-off-by: Injae Ryou <sts07142@naver.com>

sts07142 force-pushed the fix/runtime-error-in-mpclient-engine-monitor branch from 3496407 to d87ee38 Compare September 8, 2025 10:34

njhill self-assigned this Sep 8, 2025

sts07142 closed this Sep 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor #24422

[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor #24422

Uh oh!

sts07142 commented Sep 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

njhill commented Sep 9, 2025

Uh oh!

sts07142 commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor #24422

[Bugfix] Fix 'no event loop' RuntimeError in MPClientEngineMonitor #24422

Uh oh!

Conversation

sts07142 commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

As-Is

To-Be

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

njhill commented Sep 9, 2025

Uh oh!

sts07142 commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sts07142 commented Sep 8, 2025 •

edited by github-actions bot

Loading