Skip to content

[core][distributed] support variable length object in shm broadcast #5768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

youkaichao
Copy link
Member

The shm transport introduced in #5399 , reserve a fixed chunk size (1MB) for each object, which is not flexible, and waste share memory space.

This PR tries to add variable length support. The initialization argument is now, the max total bytes, and the max number of objects (basically queue size). Only when either is full, the enqueue operation will be blocking.

In most cases, it should not block, as we just have a few broadcast in vLLM, and the worker will read them before the next broadcast. The queue size will not grow dramatically.

The variable length support is more complicated than I thought. The high-watermark and low-watermark stuff is quite difficult to get it right. I added a stress test to send over 400MB data in 10K messages, which should give us enough confidence that this implementation is correct.

@njhill
Copy link
Member

njhill commented Jun 23, 2024

reserve a fixed chunk size (1MB) for each object, which is not flexible, and waste share memory space.

@youkaichao 1MB doesn't sound like much? Could we avoid the additional complexity until it becomes clear that it's needed?

@youkaichao
Copy link
Member Author

@njhill I agree this is not urgent now. But the point is not resource waste. The most important thing is how large object we can broadcast. Currently we only broadcast shape/dtype information, in the future we'd like to broadcast the whole input and separate the driver & TP 0 worker, at that time the object we broadcast will be large (contains input ids, block tables, etc.) and it is difficult to guess the upper bound.

@Yard1
Copy link
Collaborator

Yard1 commented Jun 24, 2024

I also think this optimization is premature. RAM is cheap, we can easily set this buffer to be 10, or even 100 MBs. This seems like a lot of complexity for little gain. We should revisit this only if this becomes an issue in the future.

If we cannot guess the upper bound, I think that simply reinitializing the buffer when a message that's too big is encountered with eg. desired size + 50% (up to some upper limit) will be enough.

@youkaichao
Copy link
Member Author

Thanks for your opinions ❤️ I’m also afraid this might introduce more bugs, so I’m fine with criticism!

we might need this when we use it for all the batched rpc calls proposed in this rfc , especially with async scheduling and queuing many rpc calls. but anyway, we don’t need this at present. let's revisit later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants