Make ray optional for single-node deployment #2898

njhill · 2024-02-17T01:17:51Z

ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node.

We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. This also helps with production security compliance.

With the changes in this PR, ray will continue to be used by default for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray.

Worker processes are shut down when the LLMEngine is garbage collected.

This PR was co-authored by @sahilsuneja1.

akrish2011 · 2024-02-19T19:40:15Z

This would be a great feature as I am unable to run Vllm + Mixtral on Triton Inference server as Ray workers run into OOM issues.

Yard1 · 2024-02-19T23:25:36Z

@akrish2011 Ray should not impact the memory usage - I'd wager something is misconfigured in your case

akrish2011 · 2024-02-19T23:37:01Z

@Yard1 I have to override the config of Triton such that it makes GPU available for Ray workers. I made Triton Inference server run as CPU so that the GPUs of the machine can be consumed by Ray workers. This didn't happen when I was running my application as single GPU on Triton as no Ray serve was being used, Vllm was able to utilize GPU exposed by Triton Inference server. It is a good practice to remove such dependencies like Ray serve when we try to run LLM in production. If Vllm has a dependency on Ray serve for just communication or broadcasting messages it can get impacted being used widely.

Yard1 · 2024-02-20T00:12:25Z

Just to clarify, Ray Serve is not used in vLLM - Ray Core (the low level API) is.

rkooo567 · 2024-02-20T08:02:11Z

When single GPU is used, RAY is also not used. (it's only used when it uses TP > 1 iiuc)

@akrish2011 do you mind giving me a little more detail for

'''
I have to override the config of Triton such that it makes GPU available for Ray workers.
'''

Besides this this PR, I'd like to understand what config needed to be overriden

lroberts7 · 2024-02-22T22:47:46Z

When single GPU is used, RAY is also not used. (it's only used when it uses TP > 1 iiuc)

I can confirm that ray is not being used if TP=1, I've confirmed this in profiling work I've done on vllm.

njhill · 2024-02-26T23:38:32Z

It would be nice to move the task submission behind a simple abstraction instead of the current if ray/elses. But am thinking to do that as a follow-on since it may make the deltas here harder to review.

ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node. We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. With the changes in this PR, ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. Worker processes are shut down when the LLMEngine is garbage collected. Co-authored-by: Sahil Suneja <suneja@us.ibm.com>

njhill · 2024-03-05T01:30:03Z

@zhuohan123 @WoosukKwon @simon-mo WDYT about getting this one in? It has been working well for us in internal deployments.

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

Co-authored-by: Sahil Suneja <suneja@us.ibm.com>

Co-authored-by: Sahil Suneja <suneja@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

njhill · 2024-03-18T16:02:16Z

@zhuohan123 I have replaced this with #3466 based on your new abstraction, PTAL!

njhill mentioned this pull request Feb 17, 2024

Remove Ray for the dependency #208

Closed

njhill force-pushed the ray-optional branch from ea80e27 to 4937839 Compare February 19, 2024 22:13

njhill force-pushed the ray-optional branch from 4937839 to 30a6ca1 Compare February 26, 2024 23:35

njhill force-pushed the ray-optional branch 2 times, most recently from 4728bcd to 60722ba Compare March 4, 2024 17:09

njhill and others added 5 commits March 4, 2024 15:25

Fix setuptools extras_requires

bd58c68

Rename initialize_cluster to initialize_ray_cluster

dde90a3

Fix CUDA initialization after rebase

6000d64

Clean up worker logging

f69dd4d

njhill force-pushed the ray-optional branch from 60722ba to f69dd4d Compare March 4, 2024 23:54

Use dedicated multiprocessing context for workers

6ad3fa6

njhill mentioned this pull request Mar 5, 2024

Add distributed model executor abstraction #3191

Merged

3 tasks

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024

Changes pending from vllm-project/vllm#2898

059d25c

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024

Changes pending from vllm-project/vllm#2898

03bb50b

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024

Changes pending from vllm-project/vllm#2898

3ea4d0e

Co-authored-by: Sahil Suneja <suneja@us.ibm.com>

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024

Changes pending from vllm-project/vllm#2898

a5b180f

Co-authored-by: Sahil Suneja <suneja@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 12, 2024

Changes pending from vllm-project/vllm#2898

9194383

Co-authored-by: Sahil Suneja <suneja@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

njhill mentioned this pull request Mar 18, 2024

[Core] Multiprocessing executor for single-node multi-GPU deployment #3466

Closed

njhill closed this Mar 18, 2024

njhill mentioned this pull request Mar 31, 2024

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

Closed

njhill mentioned this pull request May 1, 2024

[Core] Add MultiprocessingGPUExecutor #4539

Merged

njhill deleted the ray-optional branch May 15, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make ray optional for single-node deployment #2898

Make ray optional for single-node deployment #2898

Uh oh!

njhill commented Feb 17, 2024 •

edited

Loading

Uh oh!

akrish2011 commented Feb 19, 2024

Uh oh!

Yard1 commented Feb 19, 2024

Uh oh!

akrish2011 commented Feb 19, 2024

Uh oh!

Yard1 commented Feb 20, 2024

Uh oh!

rkooo567 commented Feb 20, 2024 •

edited

Loading

Uh oh!

lroberts7 commented Feb 22, 2024

Uh oh!

njhill commented Feb 26, 2024

Uh oh!

njhill commented Mar 5, 2024

Uh oh!

njhill commented Mar 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Make ray optional for single-node deployment #2898

Make ray optional for single-node deployment #2898

Uh oh!

Conversation

njhill commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akrish2011 commented Feb 19, 2024

Uh oh!

Yard1 commented Feb 19, 2024

Uh oh!

akrish2011 commented Feb 19, 2024

Uh oh!

Yard1 commented Feb 20, 2024

Uh oh!

rkooo567 commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lroberts7 commented Feb 22, 2024

Uh oh!

njhill commented Feb 26, 2024

Uh oh!

njhill commented Mar 5, 2024

Uh oh!

njhill commented Mar 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhill commented Feb 17, 2024 •

edited

Loading

rkooo567 commented Feb 20, 2024 •

edited

Loading