Replies: 4 comments 9 replies
-
Hey thanks for sharing this implementation. I'd been wondering if it was possible to do I updated my discussion on NUMA performance to link this, and hopefully I can try it out at some point. Cheers! EDIT Fixed Link copy paste error! thx! |
Beta Was this translation helpful? Give feedback.
-
This looks really interesting!
|
Beta Was this translation helpful? Give feedback.
-
Can you give any detail on your measurement method? I assume this is a very small context. Regardless, I tried out the mirror. 2S Epyc 9115, each w/12x (i.e. 24x) DDR5-4800, total 1.5TB
It's unfortunate that the full context won't fit w/Q8 on a 24x 64GB machine though, only about 1/4 of that I think. DeepSeek-R1 Q8
DeepSeek-R1 Q4
A few notes on hugepages etc. - my first time using them.
Probably a naive thought, but if one was going to spend 2x RAM on perf, wouldn't there be other things to explore such as storing materialized transposed matrices so dot-products would be faster wrt. cacheline hit and prefetch? 🤔 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the updates @usrlocalben , inspired me to give it a try too on a dual socket Intel Xeon 6980P rig with 1.5TB RAM configured in BIOS to First off, it was a pain to compile on intel given backend amx code did not contain the updated dual tensor stuff. So i just ripped out the amx stuff for now for this test. Eventually got it to actually run and does seem to use the Explicit Huge Pages which I allocated. It takes a looong time to start up haha...
Is this something I'm supposed to do, or is already in the codebase? I ask because it seems like all 64 threads running were coming from the first NUMA node and not distributed between the two (i would expect 32 threads per NUMA node.) Anyway, here are my logs if anyone else wants to have a stab at it. The idea is promising as ktransformers has shown some success, and doing per numa node "data parallel" allocation seems like it is useful in other inference engines as well So unless I figure out how to spread the allocated threads between the two NUMA nodes, not worth benchmarking yet imo. Finally, like @usrlocalben says:
The features already available in ik_llama.cpp fork like MLA and custom tensor GPU offload make it difficult to go back to mainline llama.cpp for these big MoE's like R1 and V3-0324 until those experimental branches get merged in. Logs👇 👈 Testing Logs and NotesEnable Explicit Hugepages
Compile
Try it out# Try a small enough model to fit into explicit huge pages
$ ./build/bin/llama-server \
--model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
--alias unsloth/DeepSeek-R1-UD_Q2_K_XL \
--ctx-size 8192 \
--parallel 1 \
--threads 64 \
--host 127.0.0.1 \
--port 8080
load_tensors: loading model tensors, this can take a while... (mmap = true)
numa_set_preferred(0)
failed to open hugepage fd /dev/hugepages/llama-node0-0: 13 Permission denied
llama_model_load: error loading model: failed to open hugepage fd: Permission denied
llama_model_load_from_file_impl: failed to load model
## Now try again as root
$ sudo ./build/bin/llama-server \
--model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
--alias unsloth/DeepSeek-R1-UD_Q2_K_XL \
--ctx-size 8192 \
--parallel 1 \
--threads 64 \
--host 127.0.0.1 \
--port 8080
load_tensors: loading model tensors, this can take a while... (mmap = true)
numa_set_preferred(0)
mmap(/dev/hugepages/llama-node0-0) desire=0x200000000000 size=1073741824 result=0x200000000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-1) desire=0x200040000000 size=1073741824 result=0x200040000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-2) desire=0x200080000000 size=1073741824 result=0x200080000000 is_new_mem[0]=yes
...
mmap(/dev/hugepages/llama-node0-44) desire=0x200b00000000 size=1073741824 result=0x200b00000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-45) desire=0x200b40000000 size=1073741824 result=0x200b40000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-46) desire=0x200b80000000 size=1073741824 result=0x200b80000000 is_new_mem[0]=yes
numa_set_preferred(1)
mmap(/dev/hugepages/llama-node1-0) desire=0x400000000000 size=1073741824 result=0x400000000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-1) desire=0x400040000000 size=1073741824 result=0x400040000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-2) desire=0x400080000000 size=1073741824 result=0x400080000000 is_new_mem[1]=yes
...
mmap(/dev/hugepages/llama-node1-44) desire=0x400b00000000 size=1073741824 result=0x400b00000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-45) desire=0x400b40000000 size=1073741824 result=0x400b40000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-46) desire=0x400b80000000 size=1073741824 result=0x400b80000000 is_new_mem[1]=yes
begin to copy from disk to mem ...
# i see about ~1.2GB/s read from NVMe drive during this time...
begin to copy from numa0 to numa1 ...
numa_set_preferred(0)
mmap(/dev/hugepages/llama-node0-47) desire=0x200bc0000000 size=1073741824 result=0x200bc0000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-48) desire=0x200c00000000 size=1073741824 result=0x200c00000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-49) desire=0x200c40000000 size=1073741824 result=0x200c40000000 is_new_mem[0]=yes
...
mmap(/dev/hugepages/llama-node0-91) desire=0x2016c0000000 size=1073741824 result=0x2016c0000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-92) desire=0x201700000000 size=1073741824 result=0x201700000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-93) desire=0x201740000000 size=1073741824 result=0x201740000000 is_new_mem[0]=yes
numa_set_preferred(1)
mmap(/dev/hugepages/llama-node1-47) desire=0x400bc0000000 size=1073741824 result=0x400bc0000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-48) desire=0x400c00000000 size=1073741824 result=0x400c00000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-49) desire=0x400c40000000 size=1073741824 result=0x400c40000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-91) desire=0x4016c0000000 size=1073741824 result=0x4016c0000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-92) desire=0x401700000000 size=1073741824 result=0x401700000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-93) desire=0x401740000000 size=1073741824 result=0x401740000000 is_new_mem[1]=yes
begin to copy from disk to mem ...
# again i see about ~1.3GB/s read from NVMe drive during this time...
begin to copy from numa0 to numa1 ...
numa_set_preferred(0)
mmap(/dev/hugepages/llama-node0-94) desire=0x201780000000 size=1073741824 result=0x201780000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-95) desire=0x2017c0000000 size=1073741824 result=0x2017c0000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-96) desire=0x201800000000 size=1073741824 result=0x201800000000 is_new_mem[0]=yes
...
mmap(/dev/hugepages/llama-node0-138) desire=0x202280000000 size=1073741824 result=0x202280000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-139) desire=0x2022c0000000 size=1073741824 result=0x2022c0000000 is_new_mem[0]=yes
mmap(/dev/hugepages/llama-node0-140) desire=0x202300000000 size=1073741824 result=0x202300000000 is_new_mem[0]=yes
numa_set_preferred(1)
mmap(/dev/hugepages/llama-node1-94) desire=0x401780000000 size=1073741824 result=0x401780000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-95) desire=0x4017c0000000 size=1073741824 result=0x4017c0000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-96) desire=0x401800000000 size=1073741824 result=0x401800000000 is_new_mem[1]=yes
...
mmap(/dev/hugepages/llama-node1-138) desire=0x402280000000 size=1073741824 result=0x402280000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-139) desire=0x4022c0000000 size=1073741824 result=0x4022c0000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-140) desire=0x402300000000 size=1073741824 result=0x402300000000 is_new_mem[1]=yes
begin to copy from disk to mem ...
...
# blah blah it keeps going slowly counting up
mmap(/dev/hugepages/llama-node1-211) desire=0x4034c0000000 size=1073741824 result=0x4034c0000000 is_new_mem[1]=yes
mmap(/dev/hugepages/llama-node1-212) desire=0x403500000000 size=1073741824 result=0x403500000000 is_new_mem[1]=yes
begin to copy from disk to mem ...
begin to copy from numa0 to numa1 ...
load_tensors: CPU_Mapped model buffer size = 47485.39 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 47681.52 MiB
load_tensors: CPU_Mapped model buffer size = 25569.12 MiB
...................................................................................................warning: munmap failed: Invalid argument
warning: munmap failed: Invalid argument
warning: munmap failed: Invalid argument
warning: munmap failed: Invalid argument
warning: munmap failed: Invalid argument
.
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.49 MiB
init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
init: CPU KV buffer size = 39040.00 MiB
llama_context: KV self size = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB
llama_context: CPU compute buffer size = 2218.01 MiB
llama_context: graph nodes = 5025
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
thread_id = 00, node = 0, cpuid = 00
thread_id = 43, node = 0, cpuid = 43
thread_id = 48, node = 0, cpuid = 48
thread_id = 42, node = 0, cpuid = 42
thread_id = 30, node = 0, cpuid = 30
thread_id = 19, node = 0, cpuid = 19
thread_id = 37, node = 0, cpuid = 37
thread_id = 33, node = 0, cpuid = 33
thread_id = 32, node = 0, cpuid = 32
thread_id = 26, node = 0, cpuid = 26
thread_id = 40, node = 0, cpuid = 40
thread_id = 38, node = 0, cpuid = 38
thread_id = 36, node = 0, cpuid = 36
thread_id = 39, node = 0, cpuid = 39
thread_id = 25, node = 0, cpuid = 25
thread_id = 41, node = 0, cpuid = 41
thread_id = 31, node = 0, cpuid = 31
thread_id = 13, node = 0, cpuid = 13
thread_id = 05, node = 0, cpuid = 05
thread_id = 14, node = 0, cpuid = 14
thread_id = 28, node = 0, cpuid = 28
thread_id = 07, node = 0, cpuid = 07
thread_id = 29, node = 0, cpuid = 29
thread_id = 18, node = 0, cpuid = 18
thread_id = 08, node = 0, cpuid = 08
thread_id = 22, node = 0, cpuid = 22
thread_id = 60, node = 0, cpuid = 60
thread_id = 06, node = 0, cpuid = 06
thread_id = 17, node = 0, cpuid = 17
thread_id = 12, node = 0, cpuid = 12
thread_id = 51, node = 0, cpuid = 51
thread_id = 54, node = 0, cpuid = 54
thread_id = 63, node = 0, cpuid = 63
thread_id = 61, node = 0, cpuid = 61
thread_id = 56, node = 0, cpuid = 56
thread_id = 62, node = 0, cpuid = 62
thread_id = 55, node = 0, cpuid = 55
thread_id = 57, node = 0, cpuid = 57
thread_id = 59, node = 0, cpuid = 59
thread_id = 27, node = 0, cpuid = 27
thread_id = 52, node = 0, cpuid = 52
thread_id = 21, node = 0, cpuid = 21
thread_id = 10, node = 0, cpuid = 10
thread_id = 09, node = 0, cpuid = 09
thread_id = 16, node = 0, cpuid = 16
thread_id = 53, node = 0, cpuid = 53
thread_id = 01, node = 0, cpuid = 01
thread_id = 02, node = 0, cpuid = 02
thread_id = 11, node = 0, cpuid = 11
thread_id = 15, node = 0, cpuid = 15
thread_id = 49, node = 0, cpuid = 49
thread_id = 03, node = 0, cpuid = 03
thread_id = 23, node = 0, cpuid = 23
thread_id = 45, node = 0, cpuid = 45
thread_id = 34, node = 0, cpuid = 34
thread_id = 46, node = 0, cpuid = 46
thread_id = 50, node = 0, cpuid = 50
thread_id = 24, node = 0, cpuid = 24
thread_id = 47, node = 0, cpuid = 47
thread_id = 58, node = 0, cpuid = 58
thread_id = 44, node = 0, cpuid = 44
thread_id = 20, node = 0, cpuid = 20
thread_id = 04, node = 0, cpuid = 04
thread_id = 35, node = 0, cpuid = 35
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 8192
main: model loaded
main: chat template, chat_template:
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /v1/models 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 676, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 819.57 ms / 13 tokens ( 63.04 ms per token, 15.86 tokens per second)
eval time = 98072.72 ms / 664 tokens ( 147.70 ms per token, 6.77 tokens per second)
total time = 98892.29 ms / 677 tokens
srv update_slots: all slots are idle
# ask it "Count from 1 to 10 in French" via my custom dchat.py client app
srv log_server_r: request: GET /v1/models 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 676, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 819.57 ms / 13 tokens ( 63.04 ms per token, 15.86 tokens per second)
eval time = 98072.72 ms / 664 tokens ( 147.70 ms per token, 6.77 tokens per second)
total time = 98892.29 ms / 677 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
$ grep Huge /proc/meminfo
AnonHugePages: 40169472 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 400000
HugePages_Free: 181888
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 819200000 kB
$ sudo numastat -m -p $(pidof llama-server)
Per-node process memory usage (in MBs) for PID 2785429 (llama-server)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 218112.00 218112.00 436224.00
Heap 40.89 0.00 40.89
Stack 0.05 0.00 0.05
Private 39107.00 7.65 39114.65
---------------- --------------- --------------- ---------------
Total 257259.94 218119.65 475379.59
Per-node system memory usage (in MBs):
Token Unaccepted not in hash table.
Token Unaccepted not in hash table.
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 771710.76 773987.20 1545697.96
MemFree 325179.17 151853.02 477032.19
MemUsed 446531.59 622134.18 1068665.77
SwapCached 0.40 1.37 1.77
Active 39294.56 210.40 39504.96
Inactive 977.02 216733.16 217710.18
Active(anon) 39219.49 76.32 39295.80
Inactive(anon) 1.08 15.60 16.68
Active(file) 75.07 134.09 209.16
Inactive(file) 975.95 216717.56 217693.50
Unevictable 29.80 5.69 35.49
Mlocked 21.01 5.69 26.70
Dirty 0.02 0.00 0.02
Writeback 0.00 0.00 0.00
FilePages 1063.30 216866.52 217929.82
Mapped 53.62 60.35 113.98
AnonPages 39235.67 82.77 39318.44
Shmem 8.97 7.82 16.79
KernelStack 46.81 37.95 84.77
PageTables 83.63 3.31 86.95
SecPageTables 0.00 0.00 0.00
NFS_Unstable 0.00 0.00 0.00
Bounce 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00
Slab 2690.84 1905.19 4596.03
SReclaimable 665.92 184.16 850.08
SUnreclaim 2024.92 1721.03 3745.95
AnonHugePages 39188.00 40.00 39228.00
ShmemHugePages 0.00 0.00 0.00
ShmemPmdMapped 0.00 0.00 0.00
FileHugePages 0.00 0.00 0.00
FilePmdMapped 0.00 0.00 0.00
HugePages_Total 400000.00 400000.00 800000.00
HugePages_Free 181888.00 181888.00 363776.00
HugePages_Surp 0.00 0.00 0.00
KReclaimable 665.92 184.16 850.08 ObservationsIt seems odd that the 64 threads it chose are all in a single NUMA node instead of distributed equally across both? This system is set
☝️ |
Beta Was this translation helpful? Give feedback.
-
TLDR: Replicate models on each NUMA. On my platform, pure CPU inference of
QwQ-32B FP16
improved from ~6.6 token/s to ~10.7 token/s, andDeepSeek R1 671B Q8
from ~7.2 token/s to ~9.7 token/s. You can find the modified llama.cpp version here.On a dual socket system, cross-NUMA access is extremely slow.
The most memory-bandwidth-consuming component during LLM inference execution is the model itself's memory usage.
To maximize utilization of the multi-CPU-platform’s total memory bandwidth:
We can replicate one copy of our neural network per available numa node
Then use local copies when doing calculations (i.e., always work with your current NUMA node's own replica)
This would theoretically achieve double bandwidth than a single numa.
Model addresses are primarily stored in
tensor->data
. To enable access via different NUMAs' data, we modify:tensor->__data[2]
(assuming two NUMAs) instead of single pointer.When setting values for tensor->data:
Assign each NUMA its respective memory region address
How to know the per-NUMA memory locations?
Leverage Linux mmap’s capability allowing specifying virtual addresses.
During mapping phase allocate specific regions for model copies across numa nodes
Example:
NUMA node 0 uses base address starting at 0x200000000000
NUMA node 1: 0x400000000000
Assignment logic would be:
If a given data-pointer falls within
[0x2... ~0x4...]
range,then
__data[0]
retains original value, while__data[1]=original_data + (offset between the two NUMAs' bases)
When accessing
tensor->data
during runtime,the thread's current NUMA ID is retrieved via its TLS storage.
We should also bind threads to specific cores/numa nodes.
To implement this:
Modify all instances of
tensor->data
accesses in codebaseCreate helper functions:
This requires changing about 700 lines of code.
Testing Platform: 9275f × 2 + DDR5-6000MT/s×(2x12 channels)
Model Used:QwQ-32B FP16
Codebase Commit:
1e2f78a00450593e2dfa458796fcdd9987300dfc
Test Scenarios:
Scenario 1 - Single NUMA Mode:
BIOS configures all memory into single unified node with data spread between both physical nodes
QwQ-32B FP16 generate = 1085, speed = 6.66, power = 798W
DeepSeek R1 671B Q8 generate = 1022, speed = 7.19, power = 753W
Scenario 2 - Two NUMA nodes with
numactl --interleave=0,1
QwQ-32B generate = 1399, speed = 6.82, power = 806W
DeepSeek R1 671B Q8 generate = 1056, speed = 7.23, power = 728W
Scenario 3 - New GGML_NUMA_MIRROR scheme proposed above
QwQ-32B generate = 1344, speed = 10.80, power = 884W
DeepSeek R1 671B Q8 generate = 1084, speed = 9.67, power = 762W
Here's the code for you to try out: vproxy-tools/llama.cpp
The
tensor->data
modifications are stored hereBeta Was this translation helpful? Give feedback.
All reactions