Dual Epyc Genoa/Turin token generation performance bottlenecks #11733

fairydreaming · 2025-02-07T13:34:54Z

fairydreaming
Feb 7, 2025
Collaborator

Bottleneck 1 - different NUMA tensor memory layout during prompt processing and token generation

Summary: llamafile_sgemm() usage causes the memory layout of mmapped tensors to be non-optimal for token generation.

Part 1 - The Problem

I had a temporary access to a dual CPU Epyc Turin system. I did some initial performance tests with llama.cpp running on a single CPU:

For CPU 0:

$ numactl -m 0 -N 0 ./build/bin/llama-bench --numa numactl -t 16 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         pp512 |         21.52 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         tg128 |          2.34 ± 0.00 |

build: c026ba3c (4663)

For CPU 1:

$ numactl -m 1 -N 1 ./build/bin/llama-bench --numa numactl -t 16 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 3
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         pp512 |         21.54 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      16 |         tg128 |          2.32 ± 0.00 |

build: c026ba3c (4663)

Unfortunately when I run llama.cpp on both CPUs at once with --numa distribute the prompt processing performance doubles, while the token generation performance stays at the same level (actually it's even a bit worse) as with a single CPU:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 3
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.10 ± 0.02 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          2.27 ± 0.00 |

build: c026ba3c (4663)

Part 2 - The Workaround

I did some more tests and found something weird. If I run llama-bench with prompt processing and token generation tests with --numa distribute on a dual-CPU system, the result is:

(llama.cpp) fairydreaming@epyc:/data/fairydreaming/llama.cpp$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.12 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          2.40 ± 0.00 |

build: c026ba3c (4663)

but when I dropped caches and ran ONLY the generation test it magically became faster:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1 -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          4.30 ± 0.00 |

build: c026ba3c (4663)

So my current hypothesis is that the placement of tensors in memory resulting from the prompt processing is for some reason sub-optimal for the token generation. This is definitely something to investigate further.

But loading the model during generation instead of prompt processing can be a viable workaround to the problem. I mean if running a generation benchmark results in optimal placement of tensors in memory, then just run it first and you are done. The generation performance stays high after this even when running combined benchmark:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/Llama-3.1-70B-Instruct-f16.gguf -r 1
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         pp512 |         39.48 ± 0.00 |
| llama 70B F16                  | 131.42 GiB |    70.55 B | CPU        |      32 |         tg128 |          4.31 ± 0.00 |

build: c026ba3c (4663)

I described this workaround in #11744 so that people can try it.

Part 3 - The Cause

I did some more investigation on what causes this and it seems to be related to GGML_USE_LLAMAFILE and llamafile_sgemm() calls. If the model weights are loaded with these calls, the token generation performance is reduced. Example:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |        187.03 ± 0.11 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         10.59 ± 0.01 |

build: c026ba3c (4663)

When I disable GGML_USE_LLAMAFILE the token generation rate is not reduced (but prompt processing is much slower, llamafile_sgemm() gives it a huge performance boost):

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |         85.66 ± 0.07 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.26 ± 0.02 |

build: c026ba3c (4663)

When I use trick described in #11744 (with GGML_USE_LLAMAFILE enabled) it's possible to keep both prompt processing and token generation rate fast:

fairydreaming@epyc:/data/fairydreaming/llama.cpp$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         pp512 |        187.79 ± 0.12 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.17 ± 0.03 |

build: c026ba3c (4663)

Regarding the exact cause, it's a large number of remote NUMA node memory accesses during token generation if the model weights were loaded with llamafile_sgemm() calls. Measured with numatop during "slow" generation:

       PID           PROC     RMA(K)     LMA(K)    RMA/LMA        CPI     *CPU%%
     28241    llama-bench  9155918.5  6763594.5        1.4       2.97      42.3

while during "fast" generation we have:

       PID           PROC     RMA(K)     LMA(K)    RMA/LMA        CPI     *CPU%%
     28363    llama-bench    55601.5 20677757.1        0.0       1.87      42.4

It seems that llamafile_sgemm() places the model weights in disk cache memory in such a way that a large number of remote NUMA node memory accesses is needed when using the weights during token generation.

Part 4 - The Solution

Simplest solution for this problem would be to warm-up the model with token generation instead of prompt processing, so that llamafile_sgemm() calls are not used to load model weights. I tested it by commenting the EOS token in creation of the warm-up batch (so that there's only a single token in this batch) and it seems to work. I tested it by running llama-cli and then llama-bench to measure token generation rate. With a single token in the warm-up batch I have:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         17.09 ± 0.21 |

build: c026ba3c (4663)

When there are two tokens (BOS and EOS) I have:

$ ./build/bin/llama-bench --numa distribute -t 32 -m models/phi-4.gguf -p 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | CPU        |      32 |         tg128 |         10.66 ± 0.00 |

build: c026ba3c (4663)

Disadvantage of this simple workaround is that disabling the warm-up with a command line option would still cause reduction of the token generation performance.

A proper fix for this problem would be a NUMA-aware matrix multiplication implementation which:

would access given part of model weights on the same NUMA node where they were initially loaded from disk and cached
would be used both during prompt processing and token generation

Another possible solution is implementation of Megatron-LM-style tensor parallelism. In this case each NUMA node would use only its associated part of model weights and would keep them in local memory.

fairydreaming · 2025-02-18T18:28:06Z

fairydreaming
Feb 18, 2025
Collaborator Author

Bottleneck 2 - Poor dual CPU scaling of small matrix multiplications

Here's a continuation of my investigation regarding the performance bottleneck on dual CPU systems.

Tested Platforms

Platform P0

1 x AMD EPYC 9374F 32-cores CPU
12 x 32GB DDR5 4800 MT/s M321R4GA3BB6-CQK RAM
Asus K14PA-U12 motherboard

This is my Epyc Genoa workstation. I used it with NUMA NPS2 BIOS settings to emulate a dual CPU system with a very fast interconnect between CPUs.

Platform P1

2 x AMD EPYC 9175F 16-cores CPU
16 x 48GB DDR5 6400 MT/s TR548G64D852Q RAM
Asrock TURIN2D16T motherboard

Thanks for u/TastesLikeOwlbear for providing access to the system.

Platform P2

2 x AMD EPYC 9654 96-cores CPU
24 x 64GB DDR5 4800 MT/s MTC40F2046S1RC48BA1 RAM
Gigabyte MZ93-FS1 motherboard

Thanks for u/SuperSecureHuman for providing access to the system.

Software

I used the following software:

llama.cpp build 4663 with additional modifications (full MoE warmup)
likwid-bench to measure memory bandwidth (load kernel)

Test Methodology

I measured memory bandwidth in each system for a single CPU and for both CPUs at once.
For each tested LLM model:

2.1. I ran llama-bench on a single CPU by using numactl -m 0 -N 0 and --numa numactl llama.cpp option. For P0 and P1 I used 16 threads, for P2 I used 48 threads.

2.2. I ran llama-bench on both CPUs by using llama.cpp --numa distribute option. For P0 and P1 I used 32 threads, for P2 I used 96 threads.

Assuming a perfect scaling with the number of CPUs the dual CPU llama-bench run should report twice the performance compared to a single CPU run. So 200% is the theoretical maximum here.

Test Results

Memory Bandwidth

First let's see how much of the theoretical max memory bandwidth can we use. For this purpose I measured the read bandwidth with likwid-bench load kernel.

Platform	Socket	NUMA	tool	RAM Bandwidth (GB/s)	measured bandwidth (GB/s)	measured bandwidth (%)
P0	S0	NPS2	likwid-bench load	230.4	185.3	80.4%
P0	S0+S1	NPS2	likwid-bench load	460.8	361.5	78.4%
P1	S0	NPS1	likwid-bench load	409.6	378.3	92.3%
P1	S0+S1	NPS1	likwid-bench load	819.2	753.6	91.9%
P2	S0	NPS1	likwid-bench load	460.8	359.9	78.1%
P2	S0+S1	NPS1	likwid-bench load	921.6	679.2	73.6%

Turin clearly wins over Genoa in terms of the ability to use available memory bandwidth.

LLM Inference - dense models

For LLM inference tests of dense models I used a classic Llama-3.1 70b with f16 weights.
This trick was used to speed up token generation.

Platform	NUMA	S0 pp512 (t/s)	S0 tg128 (t/s)	S0+S1 pp512 (t/s)	S0+S1 pp512 (%)	S0+S1 tg128 (t/s)	S0+S1 tg128 (%)
P0	NPS2	12.92	1.25	22.97	177.7%	2.38	190.4%
P1	NPS1	21.50	2.35	39.28	182.6%	4.30	182.9%
P2	NPS1	28.38	2.19	48.03	169.2%	3.97	181.2%

We can observe pretty good performance scaling figures. It's definitely worth to use a dual CPU platform for large dense models.

LLM inference - MoE models

For LLM inference tests of dense models I used Mixtral 8x22B v0.1 with Q8_0 weights.
This trick doesn't seem to make a difference for MoE models, so I didn't use it.

Platform	NUMA	quant	S0 pp512 (t/s)	S0 tg128 (t/s)	S0+S1 pp512 (t/s)	S0+S1 pp512 (%)	S0+S1 tg128 (t/s)	S0+S1 tg128 (%)
P0	NPS2	Q8_0	13.94	3.91	25.34	181.7%	7.23	184.9%
P1	NPS1	Q8_0	15.29	6.80	29.33	191.8%	10.58	155.5%
P2	NPS1	Q8_0	31.38	7.02	55.91	178.1%	10.23	145.7%

We can observe that prompt processing performance scales very well on dual-CPU systems. But for some reason the token generation performance exhibits only moderate scaling.

LLM inference - DeepSeek V3

For LLM inference tests of models based on DeepSeek V3 architecture I used DeepSeek R1 with Q4_K_S and Q8_0 (if possible) weights.

Platform	NUMA	quant	S0 pp512 (t/s)	S0 tg128 (t/s)	S0+S1 pp512 (t/s)	S0+S1 pp512 (%)	S0+S1 tg128 (t/s)	S0+S1 tg128 (%)
P1	NPS1	Q4_K_S	19.94	9.08	34.30	172.0%	9.79	107.8%
P2	NPS1	Q4_K_S	32.91	8.48	47.95	145.7%	8.67	102.2%
P2	NPS1	Q8_0	33.04	6.03	49.19	148.8%	6.69	110.9%

The prompt processing shows moderate scaling on dual Genoa system and good scaling on dual Turin system, but the token generation performance scales very bad on both - there are barely any gains compared to using only a single socket.

Possible Causes

I thought about what may be the possible causes for this and my current working hypothesis is that the observed differences in scaling are caused by sizes of multiplied matrices. If we look at the FFN matrices of tested models they have sizes:

8192 x 28672 in Llama-3.1 70B
6144 x 16384 in Mixtral 6x22B (for a single expert)
7168 x 2048 in DeepSeek R1 (for a single expert)

Also note that I tested Llama-3.1 70B in f16, Mixtral in Q8_0 and DeepSeek R1 in Q4_K_S quantization. So there are 448MB of matrix data in Llama-3.1, 96MB in Mixtral and only around 7MB in DeepSeek. Moreover, DeepSeek R1 in Q8 scaled better than Q4_K_S.

The smaller is multiplied matrix, the higher (relatively) is synchronization and communication overhead resulting from dual CPU usage. Note that the problem does not manifest on a single-socket system with NPS2 NUMA settings.

Hypothesis verification

I verified the hypothesis with my NUMA-aware matrix multiplication benchmark program. Let's compare how multiplication scales for big matrices (28672 x 8192) and for small matrices (7168 x 2048).

Big matrix, single CPU

Starting Test
        m11s[0]: type = 0 (  f32) ne = 28672 x  8192 x     1, nb = (    4, 114688, 939524096) - Sum of tensor m11s[0] is 234881024.00
             m2: type = 0 (  f32) ne = 28672 x     1 x     1, nb = (    4, 114688, 114688) - Sum of tensor m2 is 57344.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  8192 x     1 x     1, nb = (    4, 32768, 32768) - Sum of tensor ggml_graph_node(gf, 0) is 469762048.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=48
Matrix Multiplication of (28672,8192,1) x (28672,1,1) [10 layers] - about   4.70 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      48;     10; 28672;  8192;     1;      469762048;              5482;    856.92
        1;      48;     10; 28672;  8192;     1;      469762048;              4934;    952.09
        2;      48;     10; 28672;  8192;     1;      469762048;              4721;    995.05
...
       97;      48;     10; 28672;  8192;     1;      469762048;              4503;   1043.22
       98;      48;     10; 28672;  8192;     1;      469762048;              4521;   1039.07
       99;      48;     10; 28672;  8192;     1;      469762048;              4499;   1044.15

Average                                                                               1034.44
=============================================================================================

Big matrix, dual CPU

Starting Test
        m11s[0]: type = 0 (  f32) ne = 28672 x  8192 x     1, nb = (    4, 114688, 939524096) - Sum of tensor m11s[0] is 234881024.00
             m2: type = 0 (  f32) ne = 28672 x     1 x     1, nb = (    4, 114688, 114688) - Sum of tensor m2 is 57344.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  8192 x     1 x     1, nb = (    4, 32768, 32768) - Sum of tensor ggml_graph_node(gf, 0) is 469762048.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=96
Matrix Multiplication of (28672,8192,1) x (28672,1,1) [10 layers] - about   4.70 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      96;     10; 28672;  8192;     1;      469762048;              5257;    893.59
        1;      96;     10; 28672;  8192;     1;      469762048;              4407;   1065.95
        2;      96;     10; 28672;  8192;     1;      469762048;              4305;   1091.20
...
       97;      96;     10; 28672;  8192;     1;      469762048;              2693;   1744.38
       98;      96;     10; 28672;  8192;     1;      469762048;              2728;   1722.00
       99;      96;     10; 28672;  8192;     1;      469762048;              2862;   1641.38

Average                                                                               1683.04
=============================================================================================

That's 162.7% compared to a single CPU benchmark run, acceptable scaling.

Small matrix, single CPU

Starting Test
        m11s[0]: type = 0 (  f32) ne =  7168 x  2048 x     1, nb = (    4, 28672, 58720256) - Sum of tensor m11s[0] is 14680064.00
             m2: type = 0 (  f32) ne =  7168 x     1 x     1, nb = (    4, 28672, 28672) - Sum of tensor m2 is 14336.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  2048 x     1 x     1, nb = (    4,  8192,  8192) - Sum of tensor ggml_graph_node(gf, 0) is 29360128.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=48
Matrix Multiplication of (7168,2048,1) x (7168,1,1) [100 layers] - about   2.94 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      48;    100;  7168;  2048;     1;       29360128;              4195;    699.88
        1;      48;    100;  7168;  2048;     1;       29360128;              4015;    731.26
        2;      48;    100;  7168;  2048;     1;       29360128;              3784;    775.90
...
       97;      48;    100;  7168;  2048;     1;       29360128;              3301;    889.43
       98;      48;    100;  7168;  2048;     1;       29360128;              3357;    874.59
       99;      48;    100;  7168;  2048;     1;       29360128;              3266;    898.96

Average                                                                                881.33
=============================================================================================

Small matrix, dual CPU

Starting Test
        m11s[0]: type = 0 (  f32) ne =  7168 x  2048 x     1, nb = (    4, 28672, 58720256) - Sum of tensor m11s[0] is 14680064.00
             m2: type = 0 (  f32) ne =  7168 x     1 x     1, nb = (    4, 28672, 28672) - Sum of tensor m2 is 14336.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  2048 x     1 x     1, nb = (    4,  8192,  8192) - Sum of tensor ggml_graph_node(gf, 0) is 29360128.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=96
Matrix Multiplication of (7168,2048,1) x (7168,1,1) [100 layers] - about   2.94 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      96;    100;  7168;  2048;     1;       29360128;              4405;    666.52
        1;      96;    100;  7168;  2048;     1;       29360128;              4156;    706.45
        2;      96;    100;  7168;  2048;     1;       29360128;              4091;    717.68
...
       97;      96;    100;  7168;  2048;     1;       29360128;              3101;    946.80
       98;      96;    100;  7168;  2048;     1;       29360128;              3128;    938.62
       99;      96;    100;  7168;  2048;     1;       29360128;              3144;    933.85

Average                                                                                926.03
=============================================================================================

That's only 105% compared to a single CPU benchmark run, very poor scaling. This result is consistent with my DeepSeek V3 inference experiments. It confirms my hypothesis.

5 replies

jukofyork Feb 19, 2025
Collaborator

Possible Causes

I thought about what may be the possible causes for this and my current working hypothesis is that the observed differences in scaling are caused by sizes of multiplied matrices. If we look at the FFN matrices of tested models they have sizes:
* 8192 x 28672 in Llama-3.1 70B

* 6144 x 16384 in Mixtral 6x22B (for a single expert)

* 7168 x 2048 in DeepSeek R1 (for a single expert)
Also note that I tested Llama-3.1 70B in f16, Mixtral in Q8_0 and DeepSeek R1 in Q4_K_S quantization. So there are 448MB of matrix data in Llama-3.1, 96MB in Mixtral and only around 7MB in DeepSeek. Moreover, DeepSeek R1 in Q8 scaled better than Q4_K_S.

The smaller is multiplied matrix, the higher (relatively) is synchronization and communication overhead resulting from dual CPU usage. Note that the problem does not manifest on a single-socket system with NPS2 NUMA settings.

I think this is simply due to "false sharing" - the larger the "fan-out" of the matrix the less the chance two writes happen to be located on the same cache line. This is particularly problematic for NUMA as the cache coherency traffic has to travel across the inter-socket connection.

fairydreaming Feb 20, 2025
Collaborator Author

I think this is simply due to "false sharing" - the larger the "fan-out" of the matrix the less the chance two writes happen to be located on the same cache line. This is particularly problematic for NUMA as the cache coherency traffic has to travel across the inter-socket connection.

@jukofyork In ggml_compute_forward_mul_mat_one_chunk() each thread stores partial multiplication results in a small temporary buffer. Then up to 16 results are copied at once to the output tensor buffer. So it's not like they write dot product results directly to dst - that could cause false sharing.

jukofyork Feb 20, 2025
Collaborator

@fairydreaming I was just writing out a reply to this whilst you posted, so apologies if it's not relevant - I'll have a look at ggml_compute_forward_mul_mat_one_chunk now.

bhugueney Mar 6, 2025

I think this is simply due to "false sharing" - the larger the "fan-out" of the matrix the less the chance two writes happen to be located on the same cache line. This is particularly problematic for NUMA as the cache coherency traffic has to travel across the inter-socket connection.

FWIW, I think that the issue with dual socket isn't only false sharing of cache lines, but the impact of sharing memory pages as it can cause migrations which can cause TLB shootdown.
Is there any indication of time spent in the kernel (TLB managment / NUMA load balancing) ?

ubergarm Mar 6, 2025

FWIW, I think that the issue with dual socket isn't only false sharing of cache lines, but the impact of sharing memory pages as it can cause migrations which can cause TLB shootdown. Is there any indication of time spent in the kernel (TLB managment / NUMA load balancing) ?

There's a BPF tool monitoring NUMA misplaced page migrations across nodes

$ apt-get install libbpf-tools
$ dpkg -S numamove
libbpf-tools: /usr/sbin/numamove
$ numamove --help
Usage: numamove [OPTION...]
Show page migrations of type NUMA misplaced per second.

USAGE: numamove [--help]

EXAMPLES:
    numamove              # Show page migrations' count and latency

  -v, --verbose              Verbose debug output
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

This assumings numa balancing is enabled e.g.

$ sudo sysctl  -a | grep numa_bal
kernel.numa_balancing = 1
kernel.numa_balancing_promote_rate_limit_MBps = 65536

llama.cpp does print a warning suggesting to disable numa_balancing

Also interesting attempt to adapt the ktransformers USE_NUMA=1 style Data Parallel approach into llama.cpp by copying entire model weights into each socket's NUMA node (requires double RAM).

Djip007 · 2025-02-19T23:59:25Z

Djip007
Feb 19, 2025

what happens if you allow the use of sgemm for token generations ?

llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp

Lines 2373 to 2374 in d04e716

    
           // only enable sgemm for prompt processing 
        
           if (n < 2)

=>

if (n<1)

2 replies

fairydreaming Feb 20, 2025
Collaborator Author

what happens if you allow the use of sgemm for token generations ?

@Djip007 I don't know, I haven't tried this one yet. But llamafile_sgemm is used only in ggml_mul_mat(). MoE models use lots of ggml_mul_mat_id() operations that don't use llamafile_sgemm internally. So currently I think it wouldn't change much.

Djip007 Feb 23, 2025

Yes it is only for Llama-3.1-70B-Instruct-f16.gguf that is not a MoE and use sgemm.

jukofyork · 2025-02-20T09:32:19Z

jukofyork
Feb 20, 2025
Collaborator

Is it ultimately calling this function:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Line 3970 in 0d55958

void forward_mul_mat_id(ggml_compute_params * params, ggml_tensor * op) {

If so, then it's almost certainly this loop:

        // compute each matrix multiplication in sequence
        for (int cur_a = 0; cur_a < n_as; ++cur_a) {
            const int64_t cne1 = matrix_row_counts[cur_a];

            if (cne1 == 0) {
                continue;
            }

            auto src0_cur = (const char *) src0->data + cur_a*nb02;

            //const int64_t nr0 = ne01; // src0 rows
            const int64_t nr1 = cne1; // src1 rows

            int64_t src0_cur_start = (ith * ne01) / nth;
            int64_t src0_cur_end   = ((ith + 1) * ne01) / nth;
            src0_cur_start =
                (src0_cur_start % NB_COLS) ? src0_cur_start + NB_COLS - (src0_cur_start % NB_COLS) : src0_cur_start;
            src0_cur_end = (src0_cur_end % NB_COLS) ? src0_cur_end + NB_COLS - (src0_cur_end % NB_COLS) : src0_cur_end;

            if (src0_cur_start >= src0_cur_end) return;

            for (int ir1 = 0; ir1 < nr1; ir1++) {
                struct mmid_row_mapping row_mapping = MMID_MATRIX_ROW(cur_a, ir1);
                const int id       = row_mapping.i1; // selected expert index

                const int64_t  i11 = id % ne11;
                const int64_t  i12 = row_mapping.i2; // row index in src1

                const int64_t  i1 = id;  // selected expert index
                const int64_t  i2 = i12; // row

                auto src1_col = (const char *) wdata + (i11 * nbw1 + i12 * nbw2);

                gemv<BLOC_TYPE, INTER_SIZE, NB_COLS>(
                        ne00, (float *)((char *) dst->data + (i1 * nb1 + i2 * nb2)) + src0_cur_start,
                        ne01,                    src0_cur + src0_cur_start * nb01,
                        src1_col, 1, src0_cur_end - src0_cur_start);
            }
        }

Each 7168×2048 matrix is only 28MB in 16bit float format (compared to 192MB for your Mixtral 6x22B example). This is probably smaller than your L3-cache, so multiple NUMA nodes won't do much useful here.

Something like this:

    #pragma omp parallel for schedule(dynamic, 1)
    for (int cur_a = 0; cur_a < n_as; ++cur_a) {

and writing into a temporary instead of dst->data, then an extra loop to sum the temporaries into dst->data , could help massively.

It needs to use "dynamic" as the loop has the early exit if (cne1 == 0) condition and also the fact that the number of rows in a batch (I assume in cne1) will be different for each experts' tensor. Playing about with the other OpenMP scheduling parameters, or preprocessing the jobs in another loop to avoid the work-stealing overhead of "dynamic" might improve it somewhat too.

I've no idea how gemv is implemented here though, and it may have its own OpenMP #pragmas in (that will need to be removed).

EDIT: Actually, I just saw you replied to my post above and it may be a different function - I'll leave this here anyway and go and have a look at that :)

0 replies

jukofyork · 2025-02-20T09:58:12Z

jukofyork
Feb 20, 2025
Collaborator

@fairydreaming

In ggml_compute_forward_mul_mat_one_chunk() each thread stores partial multiplication results in a small temporary buffer.

    ggml_barrier(params->threadpool);

    for (int cur_a = 0; cur_a < n_as; ++cur_a) {
        const int64_t cne1 = matrix_row_counts[cur_a];

        if (cne1 == 0) {
            continue;
        }

        const char * src0_cur = (const char *) src0->data + cur_a * nb02;
        const void * wdata = (src1->type == vec_dot_type) ? src1->data : params->wdata;
        const size_t row_size = ggml_row_size(vec_dot_type, ne10);

        const int64_t nr0 = ne01;
        const int64_t nr1 = cne1;

        int chunk_size = 16;
        if (nr0 == 1 || nr1 == 1) {
            chunk_size = 64;
        }

#if defined(__aarch64__)
        // disable for ARM
        const bool disable_chunking = true;
#else
        // disable for NUMA
        const bool disable_chunking = ggml_is_numa();
#endif // defined(__aarch64__)

        int64_t nchunk0 = (nr0 + chunk_size - 1) / chunk_size;
        int64_t nchunk1 = (nr1 + chunk_size - 1) / chunk_size;

        if (nchunk0 * nchunk1 < nth * 4 || disable_chunking) {
            nchunk0 = nr0 > nr1 ? nth : 1;
            nchunk1 = nr0 > nr1 ? 1 : nth;
        }

        const int64_t dr0 = (nr0 + nchunk0 - 1) / nchunk0;
        const int64_t dr1 = (nr1 + nchunk1 - 1) / nchunk1;

        int current_chunk = ith;

        atomic_int * current_chunk_ctr = (atomic_int *)(atomic_current_chunk + cur_a);

        while (current_chunk < nchunk0 * nchunk1) {
            const int64_t ith0 = current_chunk % nchunk0;
            const int64_t ith1 = current_chunk / nchunk0;

            const int64_t ir0_start = dr0 * ith0;
            const int64_t ir0_end = MIN(ir0_start + dr0, nr0);

            const int64_t ir1_start = dr1 * ith1;
            const int64_t ir1_end = MIN(ir1_start + dr1, nr1);

            ggml_compute_forward_mul_mat_id_one_chunk(
                dst, src0, src1, ids, cur_a,
                ir0_start, ir0_end, ir1_start, ir1_end,
                src0_cur, matrix_rows, row_size, src1_cont, wdata
            );

            if (nth >= nchunk0 * nchunk1) {
                break;
            }

            current_chunk = atomic_fetch_add_explicit(current_chunk_ctr, 1, memory_order_relaxed);
        }
    }
}

I can see the atomic_add at the end there for each expert's contribution, and the ggml_barrier(params->threadpool), but where is the actual threading:

static void ggml_compute_forward_mul_mat_id_one_chunk(
    struct ggml_tensor * dst,
    const struct ggml_tensor * src0,
    const struct ggml_tensor * src1,
    const struct ggml_tensor * ids,
    const int64_t cur_a,
    const int64_t ir0_start,
    const int64_t ir0_end,
    const int64_t ir1_start,
    const int64_t ir1_end,
    const char * src0_cur,
    const struct mmid_row_mapping * matrix_rows,
    const size_t row_size,
    const bool src1_cont,
    const void * wdata) {

    GGML_TENSOR_BINARY_OP_LOCALS

    const enum ggml_type type = src0->type;

    ggml_vec_dot_t    const vec_dot      = type_traits_cpu[type].vec_dot;
    enum ggml_type    const vec_dot_type = type_traits_cpu[type].vec_dot_type;

    const int64_t blck_0 = 16;
    const int64_t blck_1 = 16;

    float tmp[16];

    for (int64_t iir1 = ir1_start; iir1 < ir1_end; iir1 += blck_1) {
        for (int64_t iir0 = ir0_start; iir0 < ir0_end; iir0 += blck_0) {
            for (int64_t ir1 = iir1; ir1 < iir1 + blck_1 && ir1 < ir1_end; ++ir1) {
                const int64_t _i12 = ir1; // logical row index for this expert

                struct mmid_row_mapping row_mapping = MMID_MATRIX_ROW(cur_a, _i12);
                const int id       = row_mapping.i1; // selected expert index

                const int64_t  i11 = id % ne11;
                const int64_t  i12 = row_mapping.i2; // row index in src1

                const int64_t  i1 = id;  // selected expert index
                const int64_t  i2 = i12; // row

                // desc: when src1 is not a contiguous memory block we have to calculate the offset using the strides
                //       if it is, then we have either copied the data to params->wdata and made it contiguous or we are using
                //       the original src1 data pointer, so we should index using the indices directly
                // TODO: this is a bit of a hack, we should probably have a better way to handle this
                const char * src1_col = (const char *) wdata +
                    (src1_cont || src1->type != vec_dot_type
                    ? (i11      + i12*ne11)*row_size
                    : (i11*nb11 + i12*nb12));

                float * dst_col = (float *) ((char *) dst->data + (i1*nb1 + i2*nb2));

                for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir0_end; ++ir0) {
                    vec_dot(ne00, &tmp[ir0 - iir0], 0, src0_cur + ir0*nb01, 0, src1_col, 0, 1);
                }

                memcpy(&dst_col[iir0], tmp, (MIN(iir0 + blck_0, ir0_end) - iir0)*sizeof(float));
            }
        }
    }
}

Surely it's not just in vec_dot?

5 replies

fairydreaming Feb 20, 2025
Collaborator Author

@jukofyork The ggml_compute_forward_mul_mat() function is called with different ith value (the thread number) for each thread in the thread pool. nth is the number of threads.

jukofyork Feb 20, 2025
Collaborator

Where do you see that? I must be in the wrong file again:

https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/ggml-cpu.c#L7865

calling:

https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/ggml-cpu.c#L7629

seems to have no ggml_compute_forward_mul_mat() call anywhere?

I think the change from to the gmml repo has made the symbol lookup go funny as ggml_compute_forward_mul_mat_id_one_chunk is really hard to find in the github search (I had to find an old google post about the PR and then manually open ggml-cpu.c so it's possible I'm still in the wrong place).

jukofyork Feb 20, 2025
Collaborator

Yeah, the only reference to "ggml_compute_forward_mul_mat_id_one_chunk" is this discussion now.

(I see my earlier post was looking at the ggml-cpu-aarch64.cpp file for ARM so there is good chance I'm still not on the same page here lol)

fairydreaming Feb 20, 2025
Collaborator Author

@jukofyork Yeah, github has problems finding this. There are two matrix multiplication operations - one regular ggml_compute_forward_mul_mat() and one for MoE ggml_compute_forward_mul_mat_id(). Each calls its own function for computing one chunk, so there is ggml_compute_forward_mul_mat_one_chunk() and ggml_compute_forward_mul_mat_id_one_chunk(). They are all in ggml-cpu.c.

jukofyork Feb 20, 2025
Collaborator

Thanks - I've found it now!

fairydreaming · 2025-02-20T20:38:26Z

fairydreaming
Feb 20, 2025
Collaborator Author

I created a NUMA-aware matrix vector multiplication benchmark (modified the existing old one), it's in the numa-matmul-bench branch of my llama.cpp fork: https://github.com/fairydreaming/llama.cpp/tree/numa-matmul-bench

The benchmark needs libnuma to compile.

Now I need a brave one to try it on a dual-CPU machine.

For example I ran it by using FFN tensor dimensions from llama-3.1 70B on my workstation (emulated a dual CPU system with BIOS NUMA NPS2 setting) and got the following:

Using a single CPU:

$ numactl -m 0 -N 0 ./bin/llama-numa-matmul-bench -x 28672 -y 8192 -z 1 -i 100 -t 32 -l 10 --numa numactl --mmap
main: build = 4748 (decef9b1)
main: built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
Starting Test
        m11s[0]: type = 0 (  f32) ne = 28672 x  8192 x     1, nb = (    4, 114688, 939524096) - Sum of tensor m11s[0] is 234881024.00
             m2: type = 0 (  f32) ne = 28672 x     1 x     1, nb = (    4, 114688, 114688) - Sum of tensor m2 is 57344.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  8192 x     1 x     1, nb = (    4, 32768, 32768) - Sum of tensor ggml_graph_node(gf, 0) is 469762048.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=32
Matrix Multiplication of (28672,8192,1) x (28672,1,1) [10 layers] - about   4.70 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      32;     10; 28672;  8192;     1;      469762048;              9219;    509.56
        1;      32;     10; 28672;  8192;     1;      469762048;              8339;    563.33
        2;      32;     10; 28672;  8192;     1;      469762048;              8368;    561.38
...
       97;      32;     10; 28672;  8192;     1;      469762048;              8294;    566.39
       98;      32;     10; 28672;  8192;     1;      469762048;              8294;    566.39
       99;      32;     10; 28672;  8192;     1;      469762048;              8280;    567.35

Average                                                                                565.63
=============================================================================================

Using two CPUs:

 ./bin/llama-numa-matmul-bench -x 28672 -y 8192 -z 1 -i 100 -t 64 -l 10 --numa distribute --mmap
main: build = 4748 (decef9b1)
main: built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
Starting Test
        m11s[0]: type = 0 (  f32) ne = 28672 x  8192 x     1, nb = (    4, 114688, 939524096) - Sum of tensor m11s[0] is 234881024.00
             m2: type = 0 (  f32) ne = 28672 x     1 x     1, nb = (    4, 114688, 114688) - Sum of tensor m2 is 57344.00
ggml_graph_node(gf, 0): type = 0 (  f32) ne =  8192 x     1 x     1, nb = (    4, 32768, 32768) - Sum of tensor ggml_graph_node(gf, 0) is 469762048.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=64
Matrix Multiplication of (28672,8192,1) x (28672,1,1) [10 layers] - about   4.70 gFLOPS

Iteration;NThreads;NLayers; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;      64;     10; 28672;  8192;     1;      469762048;             14613;    321.47
        1;      64;     10; 28672;  8192;     1;      469762048;              5115;    918.40
        2;      64;     10; 28672;  8192;     1;      469762048;              5130;    915.72
...
       97;      64;     10; 28672;  8192;     1;      469762048;              4335;   1083.65
       98;      64;     10; 28672;  8192;     1;      469762048;              4310;   1089.94
       99;      64;     10; 28672;  8192;     1;      469762048;              4305;   1091.20

Average                                                                               1062.94
=============================================================================================

As you can see there is almost a perfect performance scaling, since 1062.94 / 565.63 = 187.9%. So with two CPUs it works almost twice as fast as with one CPU.

Commands to try for smaller matrices (size like in DeepSeek R1 experts):

To run on a single CPU:

numactl -m 0 -N 0 ./bin/llama-numa-matmul-bench -x 7168 -y 2048 -z 1 -i 100 -t 32 -l 100 --numa numactl --mmap

To run on two CPUs:

./bin/llama-numa-matmul-bench -x 7168 -y 2048 -z 1 -i 100 -t 64 -l 100 --numa distribute --mmap

Parameter -t is the number of threads, -i is the number of benchmark iterations, parameter -l is the number of benchmark computation graph "layers". Each layer is a single matrix vector multiplication and tensor addition. The number of layers is tuned to make sure that weights of multiplied matrices won't be cached in L3 cache.

You can also try swapping x and y values in both commands to see if it affects the performance.

On my machine I have average results 870.30 and 535.76, so the scaling ratio is 162.4% - a bit worse than for large matrix. But note that this is an "emulated" dual CPU machine, a real one shall perform worse due to limited interconnect bandwidth and increased latency (I wonder how much worse).

0 replies

fairydreaming · 2025-02-21T18:01:54Z

fairydreaming
Feb 21, 2025
Collaborator Author

I did some more experiments on a dual CPU machine (thanks to Reddit user u/SuperSecureHuman) and found the following:

When I use only 1 CPU to run the matrix multiplication benchmark with a small matrix (7168 x 2048, Q4_1) then:

18.3% of the total graph computation time is spent outside ggml_compute_forward_mul_mat()
9.1% of the total graph computation time is spent inside ggml_compute_forward_mul_mat() up to the ggml_barrier() call
the remaining 72.6% is the actual matrix multiplication calculation loop

When I use 2 CPUs to run the matrix multiplication benchmark with a small matrix (7168 x 2048, Q4_1) then:

29.7% of the total graph computation time is spent outside ggml_compute_forward_mul_mat()
15.4% of the total graph computation time is spent inside ggml_compute_forward_mul_mat() up to the ggml_barrier() call
the remaining 54.9 is the actual matrix multiplication calculation loop

I used 48 threads in both cases. Also the overall average graph computation time is almost equal (around 3ms) in both benchmark runs. While the matrix multiplication computation loop completes faster using two processors (47.8% faster), the additional synchronization overhead (barriers etc) completely negates this advantage.

2 replies

jonndoe Feb 21, 2025

Yet, I don't have that deep skills to drill into such a debris of this issue, but I really appreciate your doing and efforts. Big Thanks. I hope it will be possible to find a solution to fix it, because running LLM on dual-cpu setup and got that it is actually slower that single-socket - is really frustrating.

fairydreaming Feb 23, 2025
Collaborator Author

@jonndoe It's hard. It's like the code is screaming I DON'T WANT TO BE RUN ON NUMA MACHINES, LEAVE ME ALONE. At least that's how I feel after all this investigation. I need a break.

ubergarm · 2025-02-27T18:18:56Z

ubergarm
Feb 27, 2025

Thanks a lot for linking this deep dive. I added it to the top of related issues on my similar dual socket Intel Xeon discussion #11733

Funny how ktransformer's solution is load the entire weights into RAM twice - once for each socket haha... though i haven't been able to actually test it as this test rig has no GPU...

I'll see if I can run some of your tools and try some of the benchmark tricks to warm up with tg first etc. Thanks for sharing all your efforts!

1 reply

fairydreaming Feb 27, 2025
Collaborator Author

Thanks a lot for linking this deep dive. I added it to the top of related issues on my similar dual socket Intel Xeon discussion #11733

Funny how ktransformer's solution is load the entire weights into RAM twice - once for each socket haha... though i haven't been able to actually test it as this test rig has no GPU...

I'll see if I can run some of your tools and try some of the benchmark tricks to warm up with tg first etc. Thanks for sharing all your efforts!

@ubergarm No problem, I think I can also answer some of your questions:

based on my experiments KV cache is indeed allocated on a single NUMA node that the main thread is placed on. I tried to change it with NUMA-aware KV cache buffer type (experimental) #11580 that distributes the allocated memory among all NUMA nodes, but it doesn't seem to affect the performance.
I think llama-rpc won't help there. You can use it to distribute model layers among CPUs, but that won't increase performance (but maybe it can prevent performance loss if with two CPUs performance is worse than on a single CPU).
also note that tg warmup trick seems to work only for dense models.

Good luck!

CommanderLake · 2025-03-18T17:02:46Z

CommanderLake
Mar 18, 2025

I found that if i use _putenv("OMP_PROC_BIND=close"); to spread the OpenMP threads evenly across all NUMA nodes the context or kv cache memory is actually evenly distributed across all nodes and only the model weights are not evenly distributed.

0 replies

Dual Epyc Genoa/Turin token generation performance bottlenecks #11733

Uh oh!

Uh oh!

fairydreaming Feb 7, 2025 Collaborator

Bottleneck 1 - different NUMA tensor memory layout during prompt processing and token generation

Part 1 - The Problem

Part 2 - The Workaround

Part 3 - The Cause

Part 4 - The Solution

Replies: 8 comments · 15 replies

Uh oh!

Uh oh!

fairydreaming Feb 18, 2025 Collaborator Author

Bottleneck 2 - Poor dual CPU scaling of small matrix multiplications

Tested Platforms

Platform P0

Platform P1

Platform P2

Software

Test Methodology

Test Results

Memory Bandwidth

LLM Inference - dense models

LLM inference - MoE models

LLM inference - DeepSeek V3

Possible Causes

Hypothesis verification

Big matrix, single CPU

Big matrix, dual CPU

Small matrix, single CPU

Small matrix, dual CPU

Uh oh!

jukofyork Feb 19, 2025 Collaborator

Possible Causes

Uh oh!

fairydreaming Feb 20, 2025 Collaborator Author

Uh oh!

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

bhugueney Mar 6, 2025

Uh oh!

Uh oh!

ubergarm Mar 6, 2025

Uh oh!

Djip007 Feb 19, 2025

Uh oh!

fairydreaming Feb 20, 2025 Collaborator Author

Uh oh!

Djip007 Feb 23, 2025

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

fairydreaming Feb 20, 2025 Collaborator Author

Uh oh!

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

Uh oh!

fairydreaming Feb 20, 2025 Collaborator Author

Uh oh!

jukofyork Feb 20, 2025 Collaborator

Uh oh!

Uh oh!

fairydreaming Feb 20, 2025 Collaborator Author

Uh oh!

Uh oh!

fairydreaming Feb 21, 2025 Collaborator Author

Uh oh!

jonndoe Feb 21, 2025

Uh oh!

fairydreaming Feb 23, 2025 Collaborator Author

Uh oh!

fairydreaming
Feb 7, 2025
Collaborator

Replies: 8 comments 15 replies

fairydreaming
Feb 18, 2025
Collaborator Author

jukofyork Feb 19, 2025
Collaborator

fairydreaming Feb 20, 2025
Collaborator Author

jukofyork Feb 20, 2025
Collaborator

Djip007
Feb 19, 2025

fairydreaming Feb 20, 2025
Collaborator Author

jukofyork
Feb 20, 2025
Collaborator

jukofyork
Feb 20, 2025
Collaborator

fairydreaming Feb 20, 2025
Collaborator Author

jukofyork Feb 20, 2025
Collaborator

jukofyork Feb 20, 2025
Collaborator

fairydreaming Feb 20, 2025
Collaborator Author

jukofyork Feb 20, 2025
Collaborator

fairydreaming
Feb 20, 2025
Collaborator Author

fairydreaming
Feb 21, 2025
Collaborator Author

fairydreaming Feb 23, 2025
Collaborator Author