Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] fix error when set main gpu to non-zero #5901

Merged
merged 2 commits into from
Mar 7, 2024

Conversation

NeoZhangJianyu
Copy link
Collaborator

fix the error when set main gpu to non-zero.
support to set to single gpu mode.

@NeoZhangJianyu NeoZhangJianyu changed the title fix error when set main gpu to non-zero [SYCL] fix error when set main gpu to non-zero Mar 6, 2024
Copy link
Collaborator

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to ensure the main_gpu always be the most powerful one?

Comment on lines +3753 to +3759
#ifdef GGML_USE_SYCL
if (split_mode == LLAMA_SPLIT_MODE_NONE) {
ggml_backend_sycl_set_single_device(main_gpu);
//SYCL use device index (0, 1, 2), instead if device id.
main_gpu = ggml_backend_sycl_get_device_index(main_gpu);
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backends should not require calling backend-specific functions for normal usage. Is this ggml_backend_sycl_set_single_device function really necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the author want to distinguish iGPU and dGPU and offload to dGPU if no device appointed, thus Backends need to query sycl-backend for device list and return the most powerful one. Do you have any suggestion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the dGPU always be mapped to the device index zero? That way, it would be used by default in single GPU mode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the dGPU always be mapped to the device index zero? That way, it would be used by default in single GPU mode.

yes, this should be default. But we encountered this #5513, the user reported dGPU mapped to index 3.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the device indices that the SYCL backend uses for its buffers and backends would be automatically ordered by the most powerful GPUs available in the system, such that the lowest indices are the most powerful GPUs. If this is not possible or desirable, it would still be ok to add a function that returns the list of available GPUs ordered by the most powerful first. Then, in llama.cpp we can use that list as a lookup table to convert the device indices used in llama.cpp to the device indices passed to the SYCL backend. This translation between llama.cpp device indices and SYCL backend device indices could be implemented in the llama_default_buffer_type_offload function for the buffers, and during the backend instance creation in llama_new_context_with_model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL backend will create GPU lists with most powerful GPUs in initial.
When llama.cpp provide the parameter: split-mode and main-gpu, SYCL backend will update the GPU list:

  1. If split-mode is none, GPU list will be updated to new list include the main-gpu device index only.
  2. If split-mode is layer or row, GPU list won't be changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better method is to ask user provide gpu list to llama.cpp as parameter. ggml just create the GPU pool by the parameter, avoid to make mistake by detecting automatically.
It allows to support more feature, like mix GPUs, mix dGPU & iGPU.

If the gpu list include one GPU, the split mode is none in fact.

So, the parameters is changed:
from main-gpu + split-mode
to gpu-list + split-mode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why it is necessary to add the function ggml_backend_sycl_set_single_device. I could understand using a function such as ggml_backend_sycl_get_device_index to translate device indices from llama.cpp to the device indices used by SYCL, but that should be done always, regardless of the split mode, in llama_default_buffer_type_offload and llama_new_context_with_model as I mentioned earlier.

I am also concerned that since this function seems to change the global state of the SYCL backend, it will prevent using multiple devices simultaneously by loading a different model on each device with a different llama_model instance, and doing inference on each device in a different thread simultaneously.

In the future, we will also support using different backends simultaneously, for example so that a system with a NVIDIA device and an Intel device, we will be able to use the CUDA and SYCL backends at the same time with split mode layer. Adding these backend-specific functions will complicate the implementation of this functionality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using multiple devices with split mode layer or row, it is possible to exclude some devices by using the -ts parameter to set the split of a device to zero. For example, with -ts 1,0,1 only devices 0 and 2 will be used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL backend has two methods to get the GPUs info:

  1. automatically detect the GPUs which are most powerful. It's default behavior in most case, including unit test.
  2. according to the parameter: main-gpu comes from llama.cpp. ggml_backend_sycl_set_single_device is used in this case only.

It only impacts the GPU list in SYCL backend as global state.
This action will happen before llama_default_buffer_type_offload. so, it won't impact next model load process.

ggml-sycl.cpp Outdated Show resolved Hide resolved
@NeoZhangJianyu NeoZhangJianyu merged commit ceca1ae into ggerganov:master Mar 7, 2024
60 checks passed
NeoZhangJianyu added a commit that referenced this pull request Mar 7, 2024
@NeoZhangJianyu
Copy link
Collaborator Author

@slaren @airMeng
I make a mistake to merge this PR before finish review.
I create revert PR: #5918.
I will update the code according to your comments for better implement.

Thank you!

slaren pushed a commit that referenced this pull request Mar 7, 2024
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
* fix error when set main gpu to non-zero

* fix delete condition
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
NeoZhangJianyu added a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* fix error when set main gpu to non-zero

* fix delete condition
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* fix error when set main gpu to non-zero

* fix delete condition
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants