Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Ray for the dependency #208

Closed
lanking520 opened this issue Jun 22, 2023 · 12 comments
Closed

Remove Ray for the dependency #208

lanking520 opened this issue Jun 22, 2023 · 12 comments

Comments

@lanking520
Copy link

Using Ray in here is considering to be an overkill. You can create a multi-process distributed environment easily using torchdist or mpi launch. Internally you can leverage NCCL or MPI communication protocol for inter-process communications.

@michaelfeil
Copy link
Contributor

I would disagree. Ray has extended serving support, and extends towards LLMs and streaming.
Also there is tools like https://ray-project.github.io/kuberay/components/operator/

Perhaps isolate ray or make it optional.

@zhuohan123
Copy link
Member

zhuohan123 commented Jun 22, 2023

You can run vLLM without ray on a single GPU.

For distributed settings, vLLM has a centralized scheduler and memory manager, whose control messages need to be broadcasted to all model workers. We find ray is the easiest library to do so. If using MPI, we need to implement this centralized logic in an SPMD program, which is not a very natural choice.

We are open to an elegant solution without using ray. If you are interested, please feel free to share the idea here and contribute to the repo!

@lanking520
Copy link
Author

@michaelfeil I believe what you said has nothing to do with how do we do Tensor Parallel communication. Ray is a good package for starting a cross node (machine) cluster with shared space on memory, but not ideally for a standalone machine with distributed GPUs. Most of the currently standard library for tensor parallel libraries are using either MPI, torch dist, or simply launch multiple processes to start the inference. For example

  • DeepSpeed launcher: Just spin multiple processes and have socket listener, internally it used NCCL for communication
  • FasterTransformer: Use MPI as external launcher, using NCCL inside for communication
  • Pippy: use Torch dist as launcher (pure NCCL environment)

Having a strong tie to a distribute framework generally is not a good idea. Introduce too much overhead and unused dependenecies.

@michaelfeil
Copy link
Contributor

@lanking520 Missed that point and got you a bit wrong. Thanks for additional explanations.

@WoosukKwon
Copy link
Collaborator

@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers. The metadata are basically Python objects like list and dictionaries stored in CPU memory. We found it convenient to use Ray for sending this metadata to the workers. To my understanding, it's not easy to use MPI or torch.dist to send these Python objects to workers, is it?

@lanking520
Copy link
Author

lanking520 commented Jun 22, 2023

@WoosukKwon MPI should be easy enough to do that if you want, there is mpi4py python package allow you to send serialized object. torch.dist can only pass through tensors that makes it a bit hard for you to do.

https://mpi4py.readthedocs.io/en/stable/overview.html#communicating-python-objects-and-array-data

For torch dist, if you can serialize the object to bytes and pass it through with tensor, this will also work. But the cpu2gpu and gpu2cpu copy is kind of expensive.

Given your case that you want the message to stay in CPU, then just use mpi interface for the access should work.

You can use MPI to spin the dist environment, while enabling torch.dist("nccl") for communication. So you will have GPU2GPU and CPU2CPU communication at the same time

@lanking520
Copy link
Author

If it is purely cross CPU process communication, you can also do

https://docs.python.org/3/library/multiprocessing.shared_memory.html

shared memory access control. Launch instance in pure spin multi-processes or use MPI, or torch.dist. All should work. Even you launch with Ray will work in this case... But just a bit hard to maintain the memory since potentially illegal read/write access

@tchaton
Copy link

tchaton commented Jun 22, 2023

@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers. The metadata are basically Python objects like list and dictionaries stored in CPU memory. We found it convenient to use Ray for sending this metadata to the workers. To my understanding, it's not easy to use MPI or torch.dist to send these Python objects to workers, is it?

PyTorch distributed supports to broadcasting of pickled python objects like list, etc... Ray isn t required for it.

@WoosukKwon
Copy link
Collaborator

Thanks for the advice @lanking520! We will take that into account. Currently, we are focusing on fixing bugs & adding requested models. After these are addressed, we will look into the alternatives you suggested, and see whether they will improve the UX in vLLM.

@WoosukKwon WoosukKwon added the enhancement New feature or request label Jun 24, 2023
@soloice
Copy link

soloice commented Nov 28, 2023

@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers. The metadata are basically Python objects like list and dictionaries stored in CPU memory. We found it convenient to use Ray for sending this metadata to the workers. To my understanding, it's not easy to use MPI or torch.dist to send these Python objects to workers, is it?

I guess you can do this with torch.distributed.broadcast_object_list or torch.distributed.gather_object?

@njhill
Copy link
Member

njhill commented Feb 17, 2024

We've now opened a PR to add support for this #3466

@hmellor hmellor added feature request and removed enhancement New feature or request labels Mar 15, 2024
@hmellor
Copy link
Collaborator

hmellor commented May 18, 2024

Closed by #4539

@hmellor hmellor closed this as completed May 18, 2024
jikunshang pushed a commit to jikunshang/vllm that referenced this issue Sep 6, 2024
…llm-project#208)

Addressing issues from HabanaAI#207
Now, filtering behavior is more verbose, handling common errors and
displaying numbers of omitted buckets due to token budget (in debug log
level, buckets are printed):

```
INFO 08-27 20:57:27 profiler.py:62] Profiler enabled for: vllm-instance-1ab4f6c4d726480d8825044cf74e9af1
WARNING 08-27 20:57:27 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 20:57:27 selector.py:85] Using HabanaAttention backend.
INFO 08-27 20:57:27 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 20:57:27 habana_model_runner.py:576] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 20:57:27 habana_model_runner.py:581] Omitted 33 prompt buckets due to exceeded token budget (max_num_batched_tokens=2048)
INFO 08-27 20:57:27 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 20:57:27 habana_model_runner.py:600] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 20:57:27 habana_model_runner.py:605] Omitted 113 decode buckets due to exceeded token budget (max_num_batched_tokens=2048)
```

Legacy mode was also added, which throws a nasty error message whenever
token budget is set too low, but then it omits filtering and works as it
did previously (ran with ``VLLM_DECODE_BS_BUCKET_MIN=128
VLLM_DECODE_SEQ_BUCKET_MIN=1024 python vllm_test.py
--max-num-batched-tokens=2048``):

```
INFO 08-27 21:01:02 profiler.py:62] Profiler enabled for: vllm-instance-51f60d3978d347e992436f1dc0aa4702
WARNING 08-27 21:01:02 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 21:01:02 selector.py:85] Using HabanaAttention backend.
INFO 08-27 21:01:02 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 21:01:02 habana_model_runner.py:576] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 21:01:02 habana_model_runner.py:581] Omitted 33 prompt buckets due to exceeded token budget (max_num_batched_tokens=2048)
INFO 08-27 21:01:02 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[128, 128, 256], seq:[1024, 128, 2048]
ERROR 08-27 21:01:02 habana_model_runner.py:128] The current bucketing configuration (min, step, max_warmup): bs:[128, 128, 256], seq:[1024, 128, 2048] cannot be used with specified max_num_batched_tokens (2048), as the smallest bucket (16384) would exceed token budget. Please increase max_num_batched_tokens or decrease bucket minimum Ignoring max_num_batched_tokens at risk of out-of-memory errors.
INFO 08-27 21:01:02 habana_model_runner.py:600] Generated 32 decode buckets: [(128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152), (128, 1280), (128, 1408), (128, 1536), (128, 1664), (128, 1792), (128, 1920), (128, 2048), (256, 128), (256, 256), (256, 384), (256, 512), (256, 640), (256, 768), (256, 896), (256, 1024), (256, 1152), (256, 1280), (256, 1408), (256, 1536), (256, 1664), (256, 1792), (256, 1920), (256, 2048)]
INFO 08-27 21:01:02 habana_model_runner.py:605] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=2048)
```
mht-sharma pushed a commit to mht-sharma/vllm that referenced this issue Oct 30, 2024
* Enable RPD for single/multi gpu

Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com>

* Add rpd build instructions to Dockerfile.rocm

* Handle env path

* Fix code errors

* Move RPD based profiling over to profiling folder

* use envs vs os.getenv

---------

Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants