-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Multiprocessing executor for single-node multi-GPU deployment #3466
Conversation
d55f701
to
a1e1ede
Compare
@zhuohan123 @simon-mo this one should be ready to go! 🙏 |
f6f72a3
to
8a60fb8
Compare
@@ -322,19 +213,17 @@ def _run_workers( | |||
method)(*driver_args, **driver_kwargs) | |||
|
|||
# Get the results of the ray workers. | |||
if self.workers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
removed because workers
should always be non-empty here (i.e. if TP > 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this if
is indeed redundant. However, workers
can indeed be empty since the ray GPU executor can be run with TP=1
7232261
to
1169213
Compare
tests/engine/test_local_worker.py
Outdated
return prompt_token_ids | ||
|
||
|
||
@pytest.mark.skip("Requires multiple GPUs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our CI can test 2 GPUs, just set the num_gpus: 2
in https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo I've actually removed the new test now and just parameterized the existing distributed test to include ray and non-ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test some of the functionality from this file without LLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK these tests added in local_worker_tests.py
vllm/executor/executor_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deferring the review of the interface change to @zhuohan123
efd61b2
to
fb15723
Compare
Thanks for the review @simon-mo! I've addressed all of your comments |
@zhuohan123 @simon-mo @WoosukKwon I just tried some cursory performance comparisons, wasn't expecting the difference to be so significant. Surprisingly Ray doesn't appear to give any latency benefits over single GPU for the config I tried. Using 80GB A100s, with llama-2-7b openai completion API. Single request with 5 input tokens, 2000 generated tokens. I repeated each test request multiple times, results were very consistent.
|
@njhill Wow! 30% is quite a bit (albeit serving llama-7b over 2 A100-80G probably doesn't really make sense in practice). I will do some testing on this branch on parallel with serving benchmark and report back as well |
Thanks @ywang96, yes I'm sure this will be smaller in relative terms for larger models. But not bad given performance improvement was not the purpose of this PR. |
@njhill I did some preliminary testing on H100 TP2 with Server launch command:
Benchmark command:
With Ray workers:
This PR:
|
Thanks @ywang96, that's great! 5-6% lower TPOT still nice to have! I am doing some spot tests on TP=4 llama-70b too. |
For llama-2-70b with single request 5 input / 1000 output the times I got are 32.3 before, 30.8 after i.e. 4-5% speedup. |
I will test on A100-80G with Mixtral TP4 and TP8 just to see if 4-5% is likely the average speedup we get in general. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If the mp based approach is faster, it makes sense to change the default I think. OOC, what's the delta between this vs ray default (for a single node case) except the performance? I assume it supports logging prefix, so maybe just the debugger and the ray dashboard?
@@ -0,0 +1,242 @@ | |||
import asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: this file is probably very specific to mp based executor. Should we create executor/util and move it to there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 yes maybe the naming is bad but that's what it's meant for, equivalent to ray_utils.py
which is in the same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Let's probably move this file to executor/
? And also move ray_utils
to executor/
?
ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node. We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. This also helps with production security compliance. With the changes in this PR, ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray. Worker processes are shut down when the LLMEngine is garbage collected. Co-authored-by: Sahil Suneja <suneja@us.ibm.com>
Instead of adding equivalent new test
Useful for unit tests
I tested baichuan 13B TP=2 with/without cudagraph on A100, the performance of the PR and main branch is the same. Do you test 13B models? |
@rkooo567 yes I think that's all correct. |
My suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay Nick! Please find my comments below. I think in general this is a good feature that we should have. However, I don't quite understand what's going on in vllm/engine/local_worker_utils.py
. Let's chat offline about this.
setup.py
Outdated
def get_ray_requirement() -> Optional[Dict[str, List[str]]]: | ||
if _is_neuron(): | ||
return None | ||
return {"ray": ["ray >= 2.9"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have different requirements.txt
for different backends. Is this function necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make ray an optional extra, independent of the backend. Ray is currently supported for some backends and not others, and if it is supported the version requirement can be different, hence the if's here.
tests/engine/test_local_worker.py
Outdated
return prompt_token_ids | ||
|
||
|
||
@pytest.mark.skip("Requires multiple GPUs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a condition like
@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
and put this test in distributed test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 I think somehow you weren't reviewing the latest version of the PR, but the last time this branch was updated was a week ago.
These tests no longer require GPUs, they test the mutliprocessing mechanics in isolation. The existing distributed test is additionally parameterized to test both ray and non-ray.
if self.worker_use_ray is None: | ||
ray_found = importlib.util.find_spec("ray") is not None | ||
self.worker_use_ray = ray_found and self.world_size > 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check whether ray is successfully imported in vllm/engine/ray_utils.py
?
# Use dedicated multiprocess context for workers. | ||
# Both spawn and fork work | ||
mp_method = os.getenv("MULTIPROC_METHOD", "fork") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always use spawn
here? I don't think there will be cases when fork
will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I think fork might be faster, I can test to see if it makes non-negligible difference.
'automatically set when using more than 1 GPU') | ||
parser.add_argument( | ||
'--worker-use-ray', | ||
action=argparse.BooleanOptionalAction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 it allows having a default to a boolean arg of None
so that we can differentiate between not set and explicitly set to false.
It looks like this may have been introduced in python 3.9 though, so I guess may need to be changed anyhow.
@@ -0,0 +1,242 @@ | |||
import asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Let's probably move this file to executor/
? And also move ray_utils
to executor/
?
@@ -322,19 +213,17 @@ def _run_workers( | |||
method)(*driver_args, **driver_kwargs) | |||
|
|||
# Get the results of the ray workers. | |||
if self.workers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this if
is indeed redundant. However, workers
can indeed be empty since the ray GPU executor can be run with TP=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor looks really good! Thanks for the work!
self.result_handler.close() | ||
|
||
|
||
class LocalWorkerVllm(mp.Process): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we choose to inherit mp.Process
instead of just creating a mp.Process
instance in the class? I'm not very familiar but is this a standard practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually didn't write the first iteration of this PR ... I think it's fairly standard but probably more common/clearer to wrap an instance instead, I'll change this.
_add_prefix(sys.stdout, process_name, pid) | ||
_add_prefix(sys.stderr, process_name, pid) | ||
|
||
del self.tasks # Not used in forked process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to explicitly delete this field here? Can we just not include this field in __init__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__init__
runs in the main/parent process (since that's where this class gets constructed), the object state will be copied to the forked/spawned process when start()
is called. This task map is only used in the copy of this object in the main process - it's deleted here to make that clearer.
This will probably need to change though if we change to not subclass multiprocessing.Process
.
Just a note for this. I use Ray to do multi node batch inference with vllm. (On a 8x8*A10) And with models that fit in a single GPU it worked perfectly but trying to initialize tensor parallel models with ray, within a ray instance doesn't work. I think this solution is the only way to do multi node batch Inference with ray orchestrating the nodes. And the multiprocessing for tensor parallel 70b inside the worker node. Thank you for this PR! Hope it gets merged soon. |
My team at work has been looking for a way to do efficient autoregressive generation during LLM fine-tuning. We'd like to tap into the efficiency of vLLM, but so far haven't been able to run torch FSDP alongside vLLM on the same set of GPUs. The changes proposed in this pull request have resolved our issue. Thanks again Nick for the great work, and I'd love to see this pull request being merged very soon. |
@nivibilla @jacobthebanana I had some interrupts in the last few days but will make sure this lands this week (not this PR but the ones that replaced it). |
This was broken into smaller PRs which have now all been merged, see #4539. |
ray is a powerful platform for general purpose distributed computing but potentially overkill for the specific requirements of realtime synchronized inferencing between GPUs on a single node.
We would prefer to have a "lightweight" option without the ray dependency for non-ray cluster environments. This also helps with production security compliance.
With the changes in this PR, ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with
--no-worker-use-ray
.Worker processes are shut down when the LLMEngine is garbage collected.
This PR was co-authored by @sahilsuneja1.
This reworks the original PR #2898 to plug into the new distributed executor abstraction.
I've introduced a
MultiGPUExecutor
abstract superclass shared between the ray and vanilla multiprocessing implementations.