-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ray optional for single-node deployment #2898
Conversation
This would be a great feature as I am unable to run Vllm + Mixtral on Triton Inference server as Ray workers run into OOM issues. |
@akrish2011 Ray should not impact the memory usage - I'd wager something is misconfigured in your case |
@Yard1 I have to override the config of Triton such that it makes GPU available for Ray workers. I made Triton Inference server run as CPU so that the GPUs of the machine can be consumed by Ray workers. This didn't happen when I was running my application as single GPU on Triton as no Ray serve was being used, Vllm was able to utilize GPU exposed by Triton Inference server. It is a good practice to remove such dependencies like Ray serve when we try to run LLM in production. If Vllm has a dependency on Ray serve for just communication or broadcasting messages it can get impacted being used widely. |
Just to clarify, Ray Serve is not used in vLLM - Ray Core (the low level API) is. |
When single GPU is used, RAY is also not used. (it's only used when it uses TP > 1 iiuc) @akrish2011 do you mind giving me a little more detail for ''' Besides this this PR, I'd like to understand what config needed to be overriden |
I can confirm that ray is not being used if TP=1, I've confirmed this in profiling work I've done on vllm. |
It would be nice to move the task submission behind a simple abstraction instead of the current if ray/elses. But am thinking to do that as a follow-on since it may make the deltas here harder to review. |
4728bcd
to
60722ba
Compare
ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node. We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. With the changes in this PR, ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. Worker processes are shut down when the LLMEngine is garbage collected. Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
@zhuohan123 @WoosukKwon @simon-mo WDYT about getting this one in? It has been working well for us in internal deployments. |
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
Co-authored-by: Sahil Suneja <suneja@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Sahil Suneja <suneja@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@zhuohan123 I have replaced this with #3466 based on your new abstraction, PTAL! |
ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node.
We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. This also helps with production security compliance.
With the changes in this PR, ray will continue to be used by default for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with
--no-worker-use-ray
.Worker processes are shut down when the LLMEngine is garbage collected.
This PR was co-authored by @sahilsuneja1.