Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ray optional for single-node deployment #2898

Closed
wants to merge 6 commits into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Feb 17, 2024

ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node.

We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. This also helps with production security compliance.

With the changes in this PR, ray will continue to be used by default for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray.

Worker processes are shut down when the LLMEngine is garbage collected.

This PR was co-authored by @sahilsuneja1.

@akrish2011
Copy link

This would be a great feature as I am unable to run Vllm + Mixtral on Triton Inference server as Ray workers run into OOM issues.

@Yard1
Copy link
Collaborator

Yard1 commented Feb 19, 2024

@akrish2011 Ray should not impact the memory usage - I'd wager something is misconfigured in your case

@akrish2011
Copy link

@Yard1 I have to override the config of Triton such that it makes GPU available for Ray workers. I made Triton Inference server run as CPU so that the GPUs of the machine can be consumed by Ray workers. This didn't happen when I was running my application as single GPU on Triton as no Ray serve was being used, Vllm was able to utilize GPU exposed by Triton Inference server. It is a good practice to remove such dependencies like Ray serve when we try to run LLM in production. If Vllm has a dependency on Ray serve for just communication or broadcasting messages it can get impacted being used widely.

@Yard1
Copy link
Collaborator

Yard1 commented Feb 20, 2024

Just to clarify, Ray Serve is not used in vLLM - Ray Core (the low level API) is.

@rkooo567
Copy link
Collaborator

rkooo567 commented Feb 20, 2024

When single GPU is used, RAY is also not used. (it's only used when it uses TP > 1 iiuc)

@akrish2011 do you mind giving me a little more detail for

'''
I have to override the config of Triton such that it makes GPU available for Ray workers.
'''

Besides this this PR, I'd like to understand what config needed to be overriden

@lroberts7
Copy link

When single GPU is used, RAY is also not used. (it's only used when it uses TP > 1 iiuc)

I can confirm that ray is not being used if TP=1, I've confirmed this in profiling work I've done on vllm.

@njhill
Copy link
Member Author

njhill commented Feb 26, 2024

It would be nice to move the task submission behind a simple abstraction instead of the current if ray/elses. But am thinking to do that as a follow-on since it may make the deltas here harder to review.

@njhill njhill force-pushed the ray-optional branch 2 times, most recently from 4728bcd to 60722ba Compare March 4, 2024 17:09
njhill and others added 5 commits March 4, 2024 15:25
ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node.

We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments.

With the changes in this PR, ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used.

Worker processes are shut down when the LLMEngine is garbage collected.

Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
@njhill
Copy link
Member Author

njhill commented Mar 5, 2024

@zhuohan123 @WoosukKwon @simon-mo WDYT about getting this one in? It has been working well for us in internal deployments.

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 12, 2024
Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@njhill
Copy link
Member Author

njhill commented Mar 18, 2024

@zhuohan123 I have replaced this with #3466 based on your new abstraction, PTAL!

@njhill njhill deleted the ray-optional branch May 15, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants