Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

Merged
merged 8 commits into from
Jan 31, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jan 31, 2024

Instead of the defines, use the functions:

  • llama_max_devices()
  • llama_supports_gpu_offload()

@ggerganov ggerganov requested a review from slaren January 31, 2024 13:52
llama.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

server.cpp also needs to be updated.

ggerganov and others added 2 commits January 31, 2024 16:13
Co-authored-by: slaren <slarengh@gmail.com>
@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

I assume this is to be able to tell this value at run time, which is useful for example with dynamic linking. In this case, shouldn't LLAMA_SUPPORTS_GPU_OFFLOAD also be a function?

@ggerganov
Copy link
Owner Author

I assume this is to be able to tell this value at run time, which is useful for example with dynamic linking.

Yes, my main motivation is to reduce GPU-related conditionals in the header files. Nothing specific in mind, just has to be better like this

@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

There is still one LLAMA_SUPPORTS_GPU_OFFLOAD in common/train.cpp

@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

If there any downstream applications that depend on these defines to enable GPU offloading, they may break after this change, so it may be a good idea to put a notice.

@ggerganov ggerganov changed the title llama : remove LLAMA_MAX_DEVICES from llama.h llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD Jan 31, 2024
llama.h Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 5cb04db into master Jan 31, 2024
53 of 59 checks passed
@ggerganov ggerganov deleted the gg/remove-max-devices branch January 31, 2024 15:30
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
…ganov#5240)

* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
@irthomasthomas
Copy link

Hi, in latest lamma-cpp-python releas (0.2.39) I get this error: ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1. Is that related to this change?

@hamza233
Copy link

hamza233 commented Feb 7, 2024

Hi, in latest lamma-cpp-python releas (0.2.39) I get this error: ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1. Is that related to this change?

Getting same error

@slaren
Copy link
Collaborator

slaren commented Feb 8, 2024

It is probably related to this change. LLAMA_MAX_DEVICES has been replaced with llama_max_devices().

@cebtenzzre
Copy link
Collaborator

It is probably related to this change. LLAMA_MAX_DEVICES has been replaced with llama_max_devices().

llama-cpp-python is calling llama_max_devices() though - after all, it is written in python and gets this value at runtime. I tested it for myself with a CUBLAS build, and get llama_cpp.LLAMA_MAX_DEVICES = 16.

@slaren
Copy link
Collaborator

slaren commented Feb 8, 2024

It's a weird that there error says LLAMA_MAX_DEVICES=1 instead of 0 or undefined. Maybe they were using a CPU or Metal build, and this is not really a bug.

@hamza233
Copy link

hamza233 commented Feb 8, 2024

It's a weird that there error says LLAMA_MAX_DEVICES=1 instead of 0 or undefined. Maybe they were using a CPU or Metal build, and this is not really a bug.

torch.cuda.device_count() returns 8.
I pass n_gpu_layers=-1 and I see llm_load_tensors: offloaded 81/81 layers to GPU in verbose. The inference also runs fine but is very slow and I don't see any GPU being used with nvidia-smi.

I get ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1 when I pass tensor_split=[1] + [10]*7 in order to use all 8 GPUs.

@slaren
Copy link
Collaborator

slaren commented Feb 8, 2024

The current version of llama.cpp will always print the offloaded x/x layers to GPU message when using -ngl, even in CPU only builds. To be sure that the GPU is actually being used, you should look at the buffer sizes, it should say something like CUDA0 instead of CPU.

@irthomasthomas
Copy link

Only way I can get it to launch is to roll back llama-cpp-python to v0.2.37

@thiner
Copy link

thiner commented Feb 20, 2024

@slaren @cebtenzzre I have the same issue with the latest llama-cpp-python, is there any workaround I can bypass the LLAMA_MAX_DEVICES=0 issue?

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…ganov#5240)

* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants