-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[CI/Build][Bugfix] Fix AsyncLLM hanging issue due to unclosed ZMQ socket #22744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a hanging issue in distributed tests by modifying how asynchronous tasks are cancelled during cleanup. The change in vllm/utils/__init__.py attempts to make task cancellation more robust. However, the current implementation accesses asyncio.Task properties from a non-event-loop thread, which is not thread-safe and can lead to race conditions. I've provided a suggestion to perform the check and cancellation in a thread-safe manner. The changes in the test file appear to be a workaround to force garbage collection, which may not be necessary with a proper fix but is otherwise harmless.
|
cc @njhill |
|
CONFIRM THIS FIXES:
|
|
Seems like "Distributed Tests (2 GPUs)" still fails due to flaky test, maybe HF download issue? https://buildkite.com/vllm/fastcheck/builds/35353/steps/canvas?jid=0198a372-6145-425d-a505-4cb075134efc#0198a372-6145-425d-a505-4cb075134efc/7-9305 |
|
I've opened #22877 to hopefully fix this in a more robust way. We shouldn't need to keep references to these other sockets because they will be closed within the async task when it's cancelled. |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.FIX #22831
FIX #22839
Purpose
Previous Description
Seems these lines caused the hanging issue:
vllm/vllm/v1/engine/core_client.py
Lines 346 to 347 in e5d3d63
Tried to reproduce this issue locally, but found it hangs at gc for ~5min between tests instead of deadlock:
stats_update_task, task's unclosed ZMQ socket can cause a hanging.vllm/vllm/v1/engine/core_client.py
Lines 971 to 977 in 98deac3
stats_update_taskto fix this hanging issue.Test Plan
Test Result
Test should be no longer hanging now.
(Optional) Documentation Update