Description
When running latest git pull of llama.cpp on dual-socket EPYC Genoa system, the set_numa_thread_affinity() code attempts to set pthread affinity in a linear fashion (i = 0; i < node->n_cpus; ++i)
However, numa nodes on this system have interleaved CPUs:
found 2 numa nodes, 128 CPUs
CPUs on node 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
CPUs on node 1: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
so there are many threads not accessing local memory, making generation very slow.
I proved this to myself by disabling** the cpus not in the other numa node, and the llama.cpp code continuously faults with:
warning: pthread_setaffinity_np() failed: Invalid argument
I think the g_state.numa structure has to be modified to encode the info from /sys/devices/system/node/ and use that for a cpu mask when calling pthread_setaffinity_np
** echo 0 > /sys/devices/system/cpu/cpu$1/online