-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix num_gpus when TP > 1 #1852
Fix num_gpus when TP > 1 #1852
Conversation
@Yard1 Should we also fix this: vllm/vllm/engine/async_llm_engine.py Line 304 in f07c1ce
|
Yes, that line in AsyncEngine it should be set in the same way @WoosukKwon |
I see. It seems tricky then... |
Perhaps make it a method of |
@Yard1 Can we somehow keep |
Have an independent function taking in the parallel config, then? |
In my opinion its a regression compared to #1821 After further investigations it seems that Ray make CUDA fail with tensor parallelism if the total amount of GPU reservation is less than 1 (wich indeed no make sense). Should we not raise and exception in that case ?
And let
So it helps the user to set correct configuration and also let it use fractions of GPU with tensor parallelism. |
So the problem here is that NCCL requires separate devices to operate, but Ray has no insight into that and will try to pack the placement group into as few GPUs as possible. If the number of GPUs Ray chooses is smaller than the tensor parallelism factor, NCCL will not work. There is no easy way to prevent that aside from just not using fractional GPUs if tensor parallelism factor is greater than 1. |
@Yard1 I'ved fixed |
Fixes #1851