Skip to content

Vulkan: Don't default to CPU device (like llvmpipe) #14099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2025

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jun 10, 2025

This should fix containers/ramalama#1479

llvmpipe can still be used by setting GGML_VK_VISIBLE_DEVICES to override automatic device selection. This may be required now to allow the Github CI test-backend-ops to run for Vulkan.

… device is available, to allow fallback to CPU backend
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2025
@ericcurtin
Copy link
Collaborator

This is exactly what we need! Default to llvmpipe was silly (we would even warn that this is probably not what you want to do in the logs).

As an aside will this work with vulkan? Auto-setting ngl for vulkan could be kinda neat:

https://github.com/ggml-org/llama.cpp/pull/14067/files

// If only CPU devices are available, return without devices.
if (vk_instance.device_indices.empty()) {
for (size_t i = 0; i < devices.size(); i++) {
if (devices[i].getProperties().deviceType != vk::PhysicalDeviceType::eCpu) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible we want to consider other device types here too like:

if (devices[i].getProperties().deviceType != vk::PhysicalDeviceType::eCpu && devices[i].getProperties().deviceType != vk::PhysicalDeviceType::eIntegratedGpu)

Most integrated GPUs are slower than CPU inferencing, especially if an Integrated GPU has < 1GB VRAM it gets very questionable.

But could be discussion for another PR...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking at the various types it can be, I'd flip it:

if (devices[i].getProperties().deviceType == vk::PhysicalDeviceType::eDiscreteGpu)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but there are a lot of iGPUs that run better than CPU with Vulkan, too. It is not as straightforward to decide here, we might need a black- or whitelist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we can always override with GGML_VK_VISIBLE_DEVICES, only eDiscreteGpu would get my vote

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's your take @a-ghorbani from the Android perspective

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most integrated GPUs are slower than CPU inferencing, especially if an Integrated GPU has < 1GB VRAM it gets very questionable.

Considering we can always override with GGML_VK_VISIBLE_DEVICES, only eDiscreteGpu would get my vote

In the vast majority of cases an integrated GPU with Vulkan and -ngl 0 is going to perform better than CPU in prompt processing. Also pretty much all computer iGPUs that support Vulkan (so anything newer than Intel Skylake or the AMD GCN2 APUs) should be able to access several GBs of memory no problem. I'll admit I'm not sure about phones though.

If you look at the chart a lot of the newer integrated chips are running very well even with the model fully offloaded. Also Intel, AMD, and Nvidia are beginning to follow Apple by making fast iGPUs with more memory bandwidth.

Now there's a case where the CPU might win for prompt processing though and that's when you have one of those new 16 core AMD Zen 5 CPUs with the little 2 CU iGPU.

@ericcurtin
Copy link
Collaborator

Looks like we hit a flake

@ericcurtin ericcurtin merged commit 97340b4 into master Jun 10, 2025
43 of 47 checks passed
@ericcurtin ericcurtin deleted the 0cc4m/vulkan-disable-cpu-device branch June 10, 2025 12:01
@jeffbolznv
Copy link
Collaborator

Looks like this did indeed disable our CI coverage?

2025-06-10T11:49:02.1603371Z 28: Test command: /home/runner/work/llama.cpp/llama.cpp/build/bin/test-backend-ops
2025-06-10T11:49:02.1603888Z 28: Working Directory: .
2025-06-10T11:49:02.1604101Z 28: Test timeout computed to be: 3600
2025-06-10T11:49:02.1899893Z 28: ggml_vulkan: No devices found.
2025-06-10T11:49:02.1919320Z 28: Testing 1 devices
2025-06-10T11:49:02.1919657Z 28: 
2025-06-10T11:49:02.1919954Z 28: Backend 1/1: CPU
2025-06-10T11:49:02.1920324Z 28:   Skipping CPU backend
2025-06-10T11:49:02.1920643Z 28: 1/1 backends passed
2025-06-10T11:49:02.1921080Z 28: �[1;32mOK�[0m
2025-06-10T11:49:02.1951995Z 28/33 Test #28: test-backend-ops ..................   Passed    0.03 sec

IMO this needs to be fixed or reverted ASAP.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 10, 2025

I didn't expect it to be merged this quickly, maybe should have set it to draft. But basically you only need to set GGML_VK_VISIBLE_DEVICES=0 to override for the CI, I assume that's not hard to do.

@jeffbolznv
Copy link
Collaborator

ok, I've made an attempt at #14106 (though I'm not an expert on github workflows)

@xcvbnmp
Copy link

xcvbnmp commented Jun 10, 2025

#14099 هاي شنو

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 10, 2025
* origin/master:
llama : support GEGLU for jina-bert-v2 (ggml-org#14090)
vulkan: force device 0 in CI (ggml-org#14106)
Fixed spec timings to: accepted/tested instead of accepted/drafted (ggml-org#14104)
sync : ggml
ggml : fix weak alias win32 (whisper/0)
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099)
rpc : nicer error messages for RPC server crash (ggml-org#14076)
sync : ggml
Add in-build ggml::ggml ALIAS library (ggml/1260)
metal : use less stack memory in FA kernel (ggml-org#14088)
kv-cache : fix shift and defrag logic (ggml-org#14081)
llama : allow building all tests on windows when not using shared libs (ggml-org#13980)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression between 0.7.4 and 0.9.0
5 participants