-
Notifications
You must be signed in to change notification settings - Fork 11.9k
add GGML_USE_NUMA_MIGRATE feature to optimize cross NUMA op computation #13649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
load model error: "not enough space in the buffer" |
The issue should be caused by the GGML_PAD with page size alignment hence the overall size exceeds the original allocated size in ggml_backend_cpu_buffer_type_alloc_buffer(). |
model: Qwen3-30B-A3B-Q4_K_M |
OK, Seems an issue also related with platform, could you please retry the latest commit? I added the size with num of tensors * page_size, which should address the issue here. it should not have big impact on memory consumption, for example, for Qwen3-235B-A22B Q8_0 which has 1131 tensors, the increased memory would be 1131*4k=4.4M. |
|
Good thing is it seems it passed the buffer allocation. what's the backtrace for the seg fault? |
debug log info |
Do you have a core dump generated? if so, you might use gdb to open it to see the backtrace.
it would be better if you could build a debug build, and then use gdb attach to it.
BTW, what's the running command of your test? |
llama-bench -m Qwen3-30B-A3B-Q4_K_M.gguf -t 50 -p 512 -n 128 --numa migrate --verbnose |
so the cgraph of running llama-bench is null, I've fixed it in the latest commit, could you please give it a try? |
OK, so we could see there's obvious performance uplift for pp512 test, though there's no improvement for tg128 (guess the cross numa traffic isn't the bottleneck in the case). |
…rier as the barrier only happens in few cores
|
I saw there're 2 runs with the feature, the first is at 29.51 t/s and second run is 34.63 t/s, the base data is stable at 29.72 t/s, could you rerun the pp512 with the feature to see if it's stable at 34 t/s? if that's the case, then we could observe 17% uplift. anyway, x86 has better cross NUMA traffic latency than aarch64. |
Hi @Warshmao , will you do more testing with pp512? Or do you have an aarch64 system to verify it?
And for aarch64, there's 16.8% uplift for pp512 and 25.9% uplift for tg128 with 64 threads as base data (64 cores per numa node in the platform, will get best performance with 64 threads in aarch64 as base data), if comparing with 128 threads as base data, that's 27.9% and 140% uplift for pp512 and tg128 respectively:
So the feature would improve performance significantly for aarch64 platform. |
This PR adds GGML_USE_NUMA_MIGRATE feature to optimize cross NUMA op computation as the cross-NUMA memory access would be the bottleneck if spawning threads across NUMA nodes:
Test Results
With the feature, tested on NeoVerse N2 with multiple NUMA nodes (with 64 cores per NUMA node), see performance improvements with running llama3 model.
Base data
running command:
results:
Result with the feature
running command:
results:
For example, there's S_TG t/s 22% performance improvement with B=1, slightly S_PP t/s drop (1.8%), overall 20.7% S T/s improvement.
Tested perplexity, there's no regression with the feature:
Enablement
The feature is disabled by default, this is the guidances to enable the feature: