Description
Platform: macOS / Apple M2 Pro
Currently the current thread finalisation / synchronisation logic in ggml_graph_compute_thread relies on sched_yield() to spin idly waiting for other threads to complete :
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L16045
// wait for other threads to finish
const int last = node_n;
do {
sched_yield();
node_n = atomic_load(&state->shared->node_n);
} while (node_n == last);
The problem with that is that this is causing absolutely gigantic amounts of overhead due to context switching and falling back to the kernel with no known time as to when the thread will come back to execution.
When profiling time on an M2 Pro:
./build-llvm16-native/bin/main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 512 -p "Explain a CPU microarchitecture basics." -s 3215465
main: build = 775 (d7d2e6a)
main: seed = 3215465
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 5439.94 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0
(....)
llama_print_timings: load time = 164.84 ms
llama_print_timings: sample time = 281.18 ms / 414 runs ( 0.68 ms per token, 1472.36 tokens per second)
llama_print_timings: prompt eval time = 397.91 ms / 11 tokens ( 36.17 ms per token, 27.64 tokens per second)
llama_print_timings: eval time = 21528.10 ms / 413 runs ( 52.13 ms per token, 19.18 tokens per second)
llama_print_timings: total time = 22244.37 ms
We see that in terms of time spent, the thing is wasting a huge amount of time in swtch_pri
which is the system library for thread yielding, further then switching into kernel mode.
56.69 s 100.0% 0 s main (84121)
36.71 s 64.7% 36.71 s ggml_vec_dot_q4_0_q8_0
15.95 s 28.1% 15.95 s swtch_pri
1.82 s 3.2% 1.82 s ggml_compute_forward_mul_mat
980.30 ms 1.7% 980.30 ms ggml_vec_dot_q6_K_q8_K
445.90 ms 0.7% 445.90 ms ggml_compute_forward_dup
119.30 ms 0.2% 119.30 ms ggml_graph_compute_thread
Simply commenting out sched_yield():
// wait for other threads to finish
const int last = node_n;
do {
//sched_yield();
node_n = atomic_load(&state->shared->node_n);
} while (node_n == last);
Results in a massive performance boost:
llama_print_timings: load time = 129.83 ms
llama_print_timings: sample time = 291.24 ms / 414 runs ( 0.70 ms per token, 1421.53 tokens per second)
llama_print_timings: prompt eval time = 334.44 ms / 11 tokens ( 30.40 ms per token, 32.89 tokens per second)
llama_print_timings: eval time = 14967.86 ms / 413 runs ( 36.24 ms per token, 27.59 tokens per second)
llama_print_timings: total time = 15630.33 ms
As new profile looks like this:
41.25 s 100.0% 0 s main (84137)
33.05 s 80.1% 33.05 s ggml_vec_dot_q4_0_q8_0
4.86 s 11.7% 4.86 s ggml_graph_compute_thread
1.34 s 3.2% 1.34 s ggml_compute_forward_mul_mat
930.00 ms 2.2% 930.00 ms ggml_vec_dot_q6_K_q8_K
490.70 ms 1.1% 490.70 ms ggml_compute_forward_dup
83.40 ms 0.2% 83.40 ms llama_sample_repetition_penalty
In the above interactive Terminal command-line example we see a 43% performance boost. The % will differ depending on where you call this from due to macOS thread QoS (ssh is around +25%).
This seems like a gigantic performance issue on the operating systems which do actually yield back to the scheduler.
My solution of just commenting out the yield and just looping probably isn't the best for power efficiency, you'd probably want to implement whatever tight atomic lock / wait across the threads. Up to you what to do here, but anything is better than sched_yield().
Another issue is the thread creation; as the threads get created and joined on each layer/operation, we see the cumulative overhead to be extremely large:
1.02 min 100.0% 0 s main (82676)
36.31 s 59.4% 36.31 s ggml_vec_dot_q4_0_q8_0
16.74 s 27.4% 16.74 s swtch_pri
4.51 s 7.3% 4.51 s thread_start
1.44 s 2.3% 1.44 s ggml_compute_forward_mul_mat
805.01 ms 1.3% 805.01 ms ggml_vec_dot_q6_K_q8_K
390.01 ms 0.6% 390.01 ms ggml_compute_forward_dup
130.01 ms 0.2% 130.01 ms ggml_graph_compute_thread
It's likely we could see another 7% performance boost here by switching the work dispatch architecture to a thread pool.
Thank you for the work on the otherwise great implementation!