-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
feat: add data parallel rank to KVEventBatch #18925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
78ac551
to
aedd6a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PeaBrane looks good to me, just a few minor comments.
b85d1a8
to
4752d38
Compare
@njhill thanks for such a careful review! I accepted all your suggestions / edits. Changes should |
vllm/config.py
Outdated
data_parallel_rank: int = 0 | ||
"""The DP rank where the event is emitted from | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't expect this to be set explicitly right? It should always be set to the value of data_parallel_rank
from parallel_config
. It's here just to pass in to the publisher factory rather than something to be provided as part of the "external" KVEventsConfig
like the other values.
That should be reflected in the comment at least, and perhaps could rename it to _data_parallel_rank
or even pass to the factory as a second arg rather than being part of this config class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is a similar pattern to the Metrics stuff where the only argument to the stat logger factory is a config and a dp rank.
StatLoggerFactory = Callable[[VllmConfig, int], "StatLoggerBase"]
I think it would make the most sense to remove from the config and add to the create() function
@classmethod
def create(cls, config: Optional[KVEventsConfig], dp_rank: int) -> EventPublisher:
The only small problem here is with the StatLogger the user owns the Factory method, while here it is written in vLLM, so we have to decide if we want to always pass the dp_rank to the Publisher or how exactly to manage that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make it an optional kwarg, lemme try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PeaBrane @alec-flowers.
Just one small thing to fix I think.
72a5e0c
to
c21d367
Compare
vllm/distributed/kv_events.py
Outdated
# Create a modified config with port offsetting | ||
modified_config = copy.deepcopy(kv_events_config) | ||
|
||
# Apply port offsetting to the endpoint | ||
original_endpoint = modified_config.endpoint | ||
modified_config.endpoint = _offset_endpoint_port(original_endpoint, | ||
data_parallel_rank) | ||
modified_config.data_parallel_rank = data_parallel_rank | ||
|
||
# Apply port offsetting to the replay_endpoint if it exists | ||
if modified_config.replay_endpoint: | ||
original_replay_endpoint = modified_config.replay_endpoint | ||
modified_config.replay_endpoint = _offset_endpoint_port( | ||
original_replay_endpoint, data_parallel_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be fine as a follow-on if we want to get this merged but I'm thinking it may be better to also move the port offsetting inside the publisher implementation since it's kind of ZMQ-specific and tied to the DP rank that we're now passing. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it shouldn't be outside. It would break any other publisher.
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
9f03056
to
ed77172
Compare
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Signed-off-by: alec-flowers <aflowers@nvidia.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: alec-flowers <aflowers@nvidia.com>
6ae7ef2
to
8bf0885
Compare
Signed-off-by: alec-flowers <aflowers@nvidia.com>
8bf0885
to
9c40bc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PeaBrane @alec-flowers
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Not sure why there's a bunch of apparently unrelated test failures now :-( |
Overview:
data_parallel_rank
toKvEventBatch
, so it can be published along with the block storage / removal eventsEventPublisher
so there is no port conflictdp=2
andtp=2
Resolves #18924