Skip to content

Conversation

vishalc-ibm
Copy link

page cache pages are retained in memory of the node after running llama-bench bound to a node on multi-node systems, incuring cross-NUMA memory access penalty for subsequent runs of llama-bench bound to a different node. This commit introduces an mbind call as best effort basis to move the pages to the target node where llama-bench is executed, ensuring optimal NUMA locality. Additionally, necessary NUMA headers are included and the build is updated to link against the NUMA library.

Experiments:

  1. Run llama-bench on node 1 (base)
  2. Run llama-bench on node 0 (regression observed)
  3. Run patched llama-bench on node 0 (throughput same as base)

+ /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.39 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.49 ± 0.03

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

+ /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 4.60 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 4.67 ± 0.03

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

+ /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.35 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.46 ± 0.02

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Signed-off-by: Vishal Chourasia vishalc@linux.ibm.com

page cache pages are retained in memory of the node after running
llama-bench bound to a node on multi-node systems, incuring cross-NUMA
memory access penalty for subsequent runs of llama-bench bound to a
different node. This commit introduces an mbind call as best effort
basis to move the pages to the target node where llama-bench is
executed, ensuring optimal NUMA locality. Additionally, necessary NUMA
headers are included and the build is updated to link against the NUMA
library.

Experiments:
1. Run llama-bench on node 1  (base)
2. Run llama-bench on node 0  (regression observed)
3. Run patched llama-bench on node 0 (throughput same as base)

`+ /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.39 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.49 ± 0.03 |

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

`+ /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          4.60 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          4.67 ± 0.03 |

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

`+ /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.35 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.46 ± 0.02 |

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by:  Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
@vishalc-ibm vishalc-ibm force-pushed the move-page-cache-to-local branch from dc04685 to 97c0138 Compare June 9, 2025 08:29
@vishalc-ibm
Copy link
Author

Hi @ggerganov can you approve the workflows

@vishalc-ibm
Copy link
Author

28:   CPY(type_src=f32,type_dst=q5_1,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): [CPY] NMSE = 0.000001863 > 0.000001000 FAIL

Following test is failing.

This test checks a CPY operation involving conversion from f32 to q5_1 with a specific permutation. The Normalized Mean Squared Error (NMSE) exceeds the allowed threshold of 0.000001, resulting in a failure.

The changes introduced in this PR relate to moving stale page cache pages to the NUMA node where the benchmark most recently ran.

At this point, it's unclear how or why this change would affect the precision of this operation. If anyone has insights into how memory locality or page migration might influence quantization accuracy or the copy path, please share your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant