-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Ray for the dependency #208
Comments
I would disagree. Ray has extended serving support, and extends towards LLMs and streaming. Perhaps isolate ray or make it optional. |
You can run vLLM without ray on a single GPU. For distributed settings, vLLM has a centralized scheduler and memory manager, whose control messages need to be broadcasted to all model workers. We find ray is the easiest library to do so. If using MPI, we need to implement this centralized logic in an SPMD program, which is not a very natural choice. We are open to an elegant solution without using ray. If you are interested, please feel free to share the idea here and contribute to the repo! |
@michaelfeil I believe what you said has nothing to do with how do we do Tensor Parallel communication. Ray is a good package for starting a cross node (machine) cluster with shared space on memory, but not ideally for a standalone machine with distributed GPUs. Most of the currently standard library for tensor parallel libraries are using either MPI, torch dist, or simply launch multiple processes to start the inference. For example
Having a strong tie to a distribute framework generally is not a good idea. Introduce too much overhead and unused dependenecies. |
@lanking520 Missed that point and got you a bit wrong. Thanks for additional explanations. |
@lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to workers. The metadata are basically Python objects like list and dictionaries stored in CPU memory. We found it convenient to use Ray for sending this metadata to the workers. To my understanding, it's not easy to use MPI or torch.dist to send these Python objects to workers, is it? |
@WoosukKwon MPI should be easy enough to do that if you want, there is https://mpi4py.readthedocs.io/en/stable/overview.html#communicating-python-objects-and-array-data For torch dist, if you can serialize the object to bytes and pass it through with tensor, this will also work. But the cpu2gpu and gpu2cpu copy is kind of expensive. Given your case that you want the message to stay in CPU, then just use mpi interface for the access should work. You can use MPI to spin the dist environment, while enabling torch.dist("nccl") for communication. So you will have GPU2GPU and CPU2CPU communication at the same time |
If it is purely cross CPU process communication, you can also do https://docs.python.org/3/library/multiprocessing.shared_memory.html shared memory access control. Launch instance in pure spin multi-processes or use MPI, or torch.dist. All should work. Even you launch with Ray will work in this case... But just a bit hard to maintain the memory since potentially illegal read/write access |
PyTorch distributed supports to broadcasting of pickled python objects like list, etc... Ray isn t required for it. |
Thanks for the advice @lanking520! We will take that into account. Currently, we are focusing on fixing bugs & adding requested models. After these are addressed, we will look into the alternatives you suggested, and see whether they will improve the UX in vLLM. |
I guess you can do this with |
We've now opened a PR to add support for this #3466 |
Closed by #4539 |
…llm-project#208) Addressing issues from HabanaAI#207 Now, filtering behavior is more verbose, handling common errors and displaying numbers of omitted buckets due to token budget (in debug log level, buckets are printed): ``` INFO 08-27 20:57:27 profiler.py:62] Profiler enabled for: vllm-instance-1ab4f6c4d726480d8825044cf74e9af1 WARNING 08-27 20:57:27 utils.py:566] Pin memory is not supported on HPU. INFO 08-27 20:57:27 selector.py:85] Using HabanaAttention backend. INFO 08-27 20:57:27 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] INFO 08-27 20:57:27 habana_model_runner.py:576] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)] INFO 08-27 20:57:27 habana_model_runner.py:581] Omitted 33 prompt buckets due to exceeded token budget (max_num_batched_tokens=2048) INFO 08-27 20:57:27 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048] INFO 08-27 20:57:27 habana_model_runner.py:600] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)] INFO 08-27 20:57:27 habana_model_runner.py:605] Omitted 113 decode buckets due to exceeded token budget (max_num_batched_tokens=2048) ``` Legacy mode was also added, which throws a nasty error message whenever token budget is set too low, but then it omits filtering and works as it did previously (ran with ``VLLM_DECODE_BS_BUCKET_MIN=128 VLLM_DECODE_SEQ_BUCKET_MIN=1024 python vllm_test.py --max-num-batched-tokens=2048``): ``` INFO 08-27 21:01:02 profiler.py:62] Profiler enabled for: vllm-instance-51f60d3978d347e992436f1dc0aa4702 WARNING 08-27 21:01:02 utils.py:566] Pin memory is not supported on HPU. INFO 08-27 21:01:02 selector.py:85] Using HabanaAttention backend. INFO 08-27 21:01:02 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] INFO 08-27 21:01:02 habana_model_runner.py:576] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)] INFO 08-27 21:01:02 habana_model_runner.py:581] Omitted 33 prompt buckets due to exceeded token budget (max_num_batched_tokens=2048) INFO 08-27 21:01:02 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[128, 128, 256], seq:[1024, 128, 2048] ERROR 08-27 21:01:02 habana_model_runner.py:128] The current bucketing configuration (min, step, max_warmup): bs:[128, 128, 256], seq:[1024, 128, 2048] cannot be used with specified max_num_batched_tokens (2048), as the smallest bucket (16384) would exceed token budget. Please increase max_num_batched_tokens or decrease bucket minimum Ignoring max_num_batched_tokens at risk of out-of-memory errors. INFO 08-27 21:01:02 habana_model_runner.py:600] Generated 32 decode buckets: [(128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152), (128, 1280), (128, 1408), (128, 1536), (128, 1664), (128, 1792), (128, 1920), (128, 2048), (256, 128), (256, 256), (256, 384), (256, 512), (256, 640), (256, 768), (256, 896), (256, 1024), (256, 1152), (256, 1280), (256, 1408), (256, 1536), (256, 1664), (256, 1792), (256, 1920), (256, 2048)] INFO 08-27 21:01:02 habana_model_runner.py:605] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=2048) ```
* Enable RPD for single/multi gpu Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com> * Add rpd build instructions to Dockerfile.rocm * Handle env path * Fix code errors * Move RPD based profiling over to profiling folder * use envs vs os.getenv --------- Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com>
Using Ray in here is considering to be an overkill. You can create a multi-process distributed environment easily using torchdist or mpi launch. Internally you can leverage NCCL or MPI communication protocol for inter-process communications.
The text was updated successfully, but these errors were encountered: