Skip to content

feat: add data parallel rank to KVEventBatch #18925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 4, 2025

Conversation

PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented May 29, 2025

Overview:

  1. Add data_parallel_rank to KvEventBatch, so it can be published along with the block storage / removal events
  2. Perform a port offset based on the dp rank for EventPublisher so there is no port conflict
  3. Allow the mock subscriber to subscribe to multiple ports for testing
  4. Added a test for testing the behavior of the publisher under async multi-proc, dp=2 and tp=2

Resolves #18924

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PeaBrane looks good to me, just a few minor comments.

@PeaBrane PeaBrane force-pushed the rupei/kv-events-dp branch from b85d1a8 to 4752d38 Compare May 31, 2025 03:57
@mergify mergify bot added the documentation Improvements or additions to documentation label May 31, 2025
@PeaBrane
Copy link
Contributor Author

@njhill thanks for such a careful review! I accepted all your suggestions / edits. Changes should scheduler.py should be minimal now, and most of the changes moved into kv_events.py

vllm/config.py Outdated
Comment on lines 3661 to 3663
data_parallel_rank: int = 0
"""The DP rank where the event is emitted from
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't expect this to be set explicitly right? It should always be set to the value of data_parallel_rank from parallel_config. It's here just to pass in to the publisher factory rather than something to be provided as part of the "external" KVEventsConfig like the other values.

That should be reflected in the comment at least, and perhaps could rename it to _data_parallel_rank or even pass to the factory as a second arg rather than being part of this config class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is a similar pattern to the Metrics stuff where the only argument to the stat logger factory is a config and a dp rank.

StatLoggerFactory = Callable[[VllmConfig, int], "StatLoggerBase"]

I think it would make the most sense to remove from the config and add to the create() function

@classmethod
    def create(cls, config: Optional[KVEventsConfig], dp_rank: int) -> EventPublisher:

The only small problem here is with the StatLogger the user owns the Factory method, while here it is written in vLLM, so we have to decide if we want to always pass the dp_rank to the Publisher or how exactly to manage that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make it an optional kwarg, lemme try that

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PeaBrane @alec-flowers.

Just one small thing to fix I think.

Comment on lines 382 to 395
# Create a modified config with port offsetting
modified_config = copy.deepcopy(kv_events_config)

# Apply port offsetting to the endpoint
original_endpoint = modified_config.endpoint
modified_config.endpoint = _offset_endpoint_port(original_endpoint,
data_parallel_rank)
modified_config.data_parallel_rank = data_parallel_rank

# Apply port offsetting to the replay_endpoint if it exists
if modified_config.replay_endpoint:
original_replay_endpoint = modified_config.replay_endpoint
modified_config.replay_endpoint = _offset_endpoint_port(
original_replay_endpoint, data_parallel_rank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be fine as a follow-on if we want to get this merged but I'm thinking it may be better to also move the port offsetting inside the publisher implementation since it's kind of ZMQ-specific and tied to the DP rank that we're now passing. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it shouldn't be outside. It would break any other publisher.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2025
PeaBrane added 4 commits June 2, 2025 18:23
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
PeaBrane added 6 commits June 2, 2025 18:23
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@alec-flowers alec-flowers force-pushed the rupei/kv-events-dp branch 2 times, most recently from 9f03056 to ed77172 Compare June 3, 2025 01:25
PeaBrane and others added 7 commits June 2, 2025 18:34
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

Signed-off-by: alec-flowers <aflowers@nvidia.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: alec-flowers <aflowers@nvidia.com>
@alec-flowers alec-flowers force-pushed the rupei/kv-events-dp branch 2 times, most recently from 6ae7ef2 to 8bf0885 Compare June 3, 2025 01:43
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@mergify mergify bot added the ci/build label Jun 3, 2025
@njhill njhill enabled auto-merge (squash) June 3, 2025 18:05
@njhill
Copy link
Member

njhill commented Jun 3, 2025

Not sure why there's a bunch of apparently unrelated test failures now :-(

@simon-mo simon-mo merged commit b712be9 into vllm-project:main Jun 4, 2025
84 of 91 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Include data parallel rank in Kv Events
4 participants