Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] fix error when set main gpu to non-zero #5901
[SYCL] fix error when set main gpu to non-zero #5901
Changes from 1 commit
b060e8a
ad2ed8f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends should not require calling backend-specific functions for normal usage. Is this
ggml_backend_sycl_set_single_device
function really necessary?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the author want to distinguish iGPU and dGPU and offload to dGPU if no device appointed, thus Backends need to query sycl-backend for device list and return the most powerful one. Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the dGPU always be mapped to the device index zero? That way, it would be used by default in single GPU mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this should be default. But we encountered this #5513, the user reported dGPU mapped to index 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the device indices that the SYCL backend uses for its buffers and backends would be automatically ordered by the most powerful GPUs available in the system, such that the lowest indices are the most powerful GPUs. If this is not possible or desirable, it would still be ok to add a function that returns the list of available GPUs ordered by the most powerful first. Then, in
llama.cpp
we can use that list as a lookup table to convert the device indices used in llama.cpp to the device indices passed to the SYCL backend. This translation between llama.cpp device indices and SYCL backend device indices could be implemented in thellama_default_buffer_type_offload
function for the buffers, and during the backend instance creation inllama_new_context_with_model
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL backend will create GPU lists with most powerful GPUs in initial.
When llama.cpp provide the parameter: split-mode and main-gpu, SYCL backend will update the GPU list:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better method is to ask user provide gpu list to llama.cpp as parameter. ggml just create the GPU pool by the parameter, avoid to make mistake by detecting automatically.
It allows to support more feature, like mix GPUs, mix dGPU & iGPU.
If the gpu list include one GPU, the split mode is none in fact.
So, the parameters is changed:
from
main-gpu + split-mode
to
gpu-list + split-mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not understand why it is necessary to add the function
ggml_backend_sycl_set_single_device
. I could understand using a function such asggml_backend_sycl_get_device_index
to translate device indices from llama.cpp to the device indices used by SYCL, but that should be done always, regardless of the split mode, inllama_default_buffer_type_offload
andllama_new_context_with_model
as I mentioned earlier.I am also concerned that since this function seems to change the global state of the SYCL backend, it will prevent using multiple devices simultaneously by loading a different model on each device with a different
llama_model
instance, and doing inference on each device in a different thread simultaneously.In the future, we will also support using different backends simultaneously, for example so that a system with a NVIDIA device and an Intel device, we will be able to use the CUDA and SYCL backends at the same time with split mode
layer
. Adding these backend-specific functions will complicate the implementation of this functionality.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using multiple devices with split mode
layer
orrow
, it is possible to exclude some devices by using the-ts
parameter to set the split of a device to zero. For example, with-ts 1,0,1
only devices 0 and 2 will be used.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL backend has two methods to get the GPUs info:
ggml_backend_sycl_set_single_device
is used in this case only.It only impacts the GPU list in SYCL backend as global state.
This action will happen before
llama_default_buffer_type_offload
. so, it won't impact next model load process.