-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] fix error when set main gpu to non-zero #5901
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to ensure the main_gpu
always be the most powerful one?
#ifdef GGML_USE_SYCL | ||
if (split_mode == LLAMA_SPLIT_MODE_NONE) { | ||
ggml_backend_sycl_set_single_device(main_gpu); | ||
//SYCL use device index (0, 1, 2), instead if device id. | ||
main_gpu = ggml_backend_sycl_get_device_index(main_gpu); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends should not require calling backend-specific functions for normal usage. Is this ggml_backend_sycl_set_single_device
function really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the author want to distinguish iGPU and dGPU and offload to dGPU if no device appointed, thus Backends need to query sycl-backend for device list and return the most powerful one. Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the dGPU always be mapped to the device index zero? That way, it would be used by default in single GPU mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the dGPU always be mapped to the device index zero? That way, it would be used by default in single GPU mode.
yes, this should be default. But we encountered this #5513, the user reported dGPU mapped to index 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the device indices that the SYCL backend uses for its buffers and backends would be automatically ordered by the most powerful GPUs available in the system, such that the lowest indices are the most powerful GPUs. If this is not possible or desirable, it would still be ok to add a function that returns the list of available GPUs ordered by the most powerful first. Then, in llama.cpp
we can use that list as a lookup table to convert the device indices used in llama.cpp to the device indices passed to the SYCL backend. This translation between llama.cpp device indices and SYCL backend device indices could be implemented in the llama_default_buffer_type_offload
function for the buffers, and during the backend instance creation in llama_new_context_with_model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL backend will create GPU lists with most powerful GPUs in initial.
When llama.cpp provide the parameter: split-mode and main-gpu, SYCL backend will update the GPU list:
- If split-mode is none, GPU list will be updated to new list include the main-gpu device index only.
- If split-mode is layer or row, GPU list won't be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better method is to ask user provide gpu list to llama.cpp as parameter. ggml just create the GPU pool by the parameter, avoid to make mistake by detecting automatically.
It allows to support more feature, like mix GPUs, mix dGPU & iGPU.
If the gpu list include one GPU, the split mode is none in fact.
So, the parameters is changed:
from main-gpu + split-mode
to gpu-list + split-mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not understand why it is necessary to add the function ggml_backend_sycl_set_single_device
. I could understand using a function such as ggml_backend_sycl_get_device_index
to translate device indices from llama.cpp to the device indices used by SYCL, but that should be done always, regardless of the split mode, in llama_default_buffer_type_offload
and llama_new_context_with_model
as I mentioned earlier.
I am also concerned that since this function seems to change the global state of the SYCL backend, it will prevent using multiple devices simultaneously by loading a different model on each device with a different llama_model
instance, and doing inference on each device in a different thread simultaneously.
In the future, we will also support using different backends simultaneously, for example so that a system with a NVIDIA device and an Intel device, we will be able to use the CUDA and SYCL backends at the same time with split mode layer
. Adding these backend-specific functions will complicate the implementation of this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using multiple devices with split mode layer
or row
, it is possible to exclude some devices by using the -ts
parameter to set the split of a device to zero. For example, with -ts 1,0,1
only devices 0 and 2 will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL backend has two methods to get the GPUs info:
- automatically detect the GPUs which are most powerful. It's default behavior in most case, including unit test.
- according to the parameter: main-gpu comes from llama.cpp.
ggml_backend_sycl_set_single_device
is used in this case only.
It only impacts the GPU list in SYCL backend as global state.
This action will happen before llama_default_buffer_type_offload
. so, it won't impact next model load process.
This reverts commit ceca1ae.
* fix error when set main gpu to non-zero * fix delete condition
…" (ggerganov#5918) This reverts commit ceca1ae.
…" (ggerganov#5918) This reverts commit ceca1ae.
* fix error when set main gpu to non-zero * fix delete condition
…" (ggerganov#5918) This reverts commit ceca1ae.
* fix error when set main gpu to non-zero * fix delete condition
…" (ggerganov#5918) This reverts commit ceca1ae.
fix the error when set main gpu to non-zero.
support to set to single gpu mode.