Skip to content

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Aug 12, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

FIX #22831
FIX #22839

Purpose

Previous Description

Seems these lines caused the hanging issue:

cancel_task_threadsafe(self.output_queue_task)
cancel_task_threadsafe(self.stats_update_task)

Tried to reproduce this issue locally, but found it hangs at gc for ~5min between tests instead of deadlock:

Current thread 0x0000792424c13740 (most recent call first):
  Garbage-collecting
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 264 in term
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 322 in destroy
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 140 in __del__
  File "/kaggle/working/vllm/vllm/distributed/parallel_state.py", line 1277 in cleanup_dist_env_and_memory
  File "/kaggle/working/vllm/tests/conftest.py", line 199 in cleanup_fixture
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/_pytest/fixtures.py", line 907 in _teardown_yield_fixture

Test Plan

pytest -s -v tests/v1/test_async_llm_dp.py

Test Result

Test should be no longer hanging now.

(Optional) Documentation Update

Isotr0py and others added 2 commits August 12, 2025 14:24
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Aug 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a hanging issue in distributed tests by modifying how asynchronous tasks are cancelled during cleanup. The change in vllm/utils/__init__.py attempts to make task cancellation more robust. However, the current implementation accesses asyncio.Task properties from a non-event-loop thread, which is not thread-safe and can lead to race conditions. I've provided a suggestion to perform the check and cancellation in a thread-safe manner. The changes in the test file appear to be a workaround to force garbage collection, which may not be necessary with a proper fix but is otherwise harmless.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py changed the title [CI/Build][Bugfix] Investigate asyncllm hanging issue on distributed tests [CI/Build][Bugfix] Fix AsyncLLM hanging issue due to unclosed ZMQ socket Aug 13, 2025
@Isotr0py Isotr0py requested a review from DarkLight1337 August 13, 2025 13:17
@DarkLight1337
Copy link
Member

cc @njhill

@robertgshaw2-redhat
Copy link
Collaborator

CONFIRM THIS FIXES:

  • Distributed Tests (2 GPUs)
  • Distributed Tests (4 GPUs)

@mgoin
Copy link
Member

mgoin commented Aug 13, 2025

Seems like "Distributed Tests (2 GPUs)" still fails due to flaky test, maybe HF download issue? https://buildkite.com/vllm/fastcheck/builds/35353/steps/canvas?jid=0198a372-6145-425d-a505-4cb075134efc#0198a372-6145-425d-a505-4cb075134efc/7-9305

[2025-08-13T15:03:09Z] FAILED v1/shutdown/test_delete.py::test_llm_delete[False-True-2-meta-llama/Llama-3.2-1B] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_0': 1}

@njhill
Copy link
Member

njhill commented Aug 14, 2025

I've opened #22877 to hopefully fix this in a more robust way. We shouldn't need to keep references to these other sockets because they will be closed within the async task when it's cancelled.

@Isotr0py Isotr0py closed this Aug 14, 2025
@Isotr0py Isotr0py deleted the dp-deadlock branch August 14, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure][NIGHTLY FIRE DRILL]: Distributed Tests (2 GPUS) [CI Failure][NIGHTLY FIRE DRILL]: Distributed Tests (4GPUs)

5 participants