Skip to content

Move page cache via mbind to prevent cross-NUMA access #13731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vishalc-ibm
Copy link

page cache pages are retained in memory of the node after running llama-bench bound to a node on multi-node systems, incuring cross-NUMA memory access penalty for subsequent runs of llama-bench bound to a different node. This commit introduces an mbind call as best effort basis to move the pages to the target node where llama-bench is executed, ensuring optimal NUMA locality. Additionally, necessary NUMA headers are included and the build is updated to link against the NUMA library.

Experiments:

  1. Run llama-bench on node 1 (base)
  2. Run llama-bench on node 0 (regression observed)
  3. Run patched llama-bench on node 0 (throughput same as base)

+ /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.39 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.49 ± 0.03

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

+ /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 4.60 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 4.67 ± 0.03

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

+ /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.35 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.46 ± 0.02

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Signed-off-by: Vishal Chourasia vishalc@linux.ibm.com

page cache pages are retained in memory of the node after running
llama-bench bound to a node on multi-node systems, incuring cross-NUMA
memory access penalty for subsequent runs of llama-bench bound to a
different node. This commit introduces an mbind call as best effort
basis to move the pages to the target node where llama-bench is
executed, ensuring optimal NUMA locality. Additionally, necessary NUMA
headers are included and the build is updated to link against the NUMA
library.

Experiments:
1. Run llama-bench on node 1  (base)
2. Run llama-bench on node 0  (regression observed)
3. Run patched llama-bench on node 0 (throughput same as base)

`+ /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.39 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.49 ± 0.03 |

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

`+ /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          4.60 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          4.67 ± 0.03 |

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

`+ /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24`
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.35 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.46 ± 0.02 |

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by:  Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant