[core][distributed] fix zmq hang #6759

youkaichao · 2024-07-24T21:42:27Z

this is caused by incorrect usage of zmq, or a bug of zmq, reported at zeromq/libzmq#4713 .

By using XPUB channel, we can make sure all subscribers already subscribed, and we are ready to publish (broadcast).

Locally tested, previously it hangs once in 20 runs.

Now it runs without any problem in 1000 runs.

github-actions · 2024-07-24T21:42:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-07-24T21:42:51Z

vllm/connections.py

            raise ValueError("Invalid HTTP URL: A valid HTTP URL "
                             "must have scheme 'http' or 'https'.")

-    def _headers(self, **extras: str) -> Mapping[str, str]:


this is a lint error I fix by the way

davidthomas426

LGTM, as long as XSUB isn't needed to make this semantically correct. I didn't dig into XPUB/XSUB quite enough to know that for sure myself, yet.

It's nice to be able to fix an issue AND simplify this code a lot :)

davidthomas426 · 2024-07-24T22:02:34Z

vllm/distributed/device_communicators/shm_broadcast.py

 import torch.distributed as dist
 from torch.distributed import ProcessGroup
-from zmq import PUB, REP, REQ, SUB, SUBSCRIBE, Context  # type: ignore
+from zmq import SUB, SUBSCRIBE, XPUB, XPUB_VERBOSE, Context  # type: ignore


Do we need XSUB, too?

No, you can check https://netmq.readthedocs.io/en/latest/xpub-xsub/ .

XPUB connects to SUB.

XSUB is used to connect many PUB, which is not our usecase.

(cherry picked from commit 740374d)

[core][distributed] fix zmq hang (vllm-project#6759)

Signed-off-by: Alvant <alvasian@yandex.ru>

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

fix zmq hang

7dc4e27

youkaichao commented Jul 24, 2024

View reviewed changes

youkaichao added 3 commits July 24, 2024 14:47

add comments

051ca02

move comments

8ad6c43

move comments

1c652f2

davidthomas426 approved these changes Jul 24, 2024

View reviewed changes

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2024

youkaichao merged commit 740374d into vllm-project:main Jul 25, 2024

youkaichao deleted the fix_zmq_hang branch July 25, 2024 00:37

youkaichao mentioned this pull request Jul 26, 2024

[Installation]: Running ohereForAI/c4ai-command-r-v01 with main pytorch #6355

Closed

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

russellb pushed a commit to russellb/vllm that referenced this pull request Sep 18, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

a7c521b

(cherry picked from commit 740374d)

n1hility added a commit to opendatahub-io/vllm that referenced this pull request Oct 2, 2024

Merge pull request #165 from russellb/instructlab-zmq-hang-backport

e3f32d4

[core][distributed] fix zmq hang (vllm-project#6759)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

cd97007

Signed-off-by: Alvant <alvasian@yandex.ru>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[core][distributed] fix zmq hang (vllm-project#6759)

e0a335e

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[core][distributed] fix zmq hang #6759

[core][distributed] fix zmq hang #6759

Uh oh!

youkaichao commented Jul 24, 2024

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

youkaichao Jul 24, 2024

Uh oh!

davidthomas426 left a comment

Uh oh!

davidthomas426 Jul 24, 2024

Uh oh!

youkaichao Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

[core][distributed] fix zmq hang #6759

[core][distributed] fix zmq hang #6759

Uh oh!

Conversation

youkaichao commented Jul 24, 2024

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

youkaichao Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

davidthomas426 left a comment

Choose a reason for hiding this comment

Uh oh!

davidthomas426 Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

youkaichao Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants