-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : spread compute across threads in chunks #1507
base: master
Are you sure you want to change the base?
Conversation
I am guessing that this would be useful for Intel CPUs with E cores, and maybe also for AMD X3D CPUs in which not all cores have the same amount of cache. |
I don't see a significant improvement using this. Maybe a tiny bit faster for 12-16 threads, but hard to say |
On modern Intel CPUs it should be beneficial to assign shorter chunks to the smaller cores and bigger chunks to the performance cores, paired with affinity management. I looked into that for a while and got distracted before I was finished (with GPU optimizations), my first findings have been quite interesting. I can't say for 100% as I didn't finish debugging but to me it looked like, as soon as the atom cores are used they lag with the results causing the performance cores to stay idle. I think there is a lot of optimization potential there, if you look at how little gains are happening when running with 32 threads (on a 32 thread CPU) compared to using just 4 or 5 threads. Most performance sits idle. |
@ggerganov I've just spent 2 hours testing this commit on a 13900k which has 8 performance and 16 atom cores, on Windows 11. Most important findings
Performance improvements: Stability improvements: This is a great improvement that got much too little attention. |
I guess we can extend The proposed method in this PR might be less efficient with NUMA or in some other cases that we haven't thought about, so there should be a way to fallback to the original method |
GGML.c has change a lot since this pull request was made; would it be possible to rework and add this chunked computing to the current GGML.c to help people that have efficiency cores along with performance cores? |
This is an alternative way of distributing the work across workers. Not sure yet if it is more efficient.
The idea is for each thread to process small chunk of data (e.g.
~ nrows / (8*nth)
) and then pick another chunk.We make sure each thread processes at least 8 chunks of data.
Hoping to have more even distribution of work when using more threads and workaround threads having to wait for a single thread if it gets delayed
Currently used in
ggml_compute_forward_mul_mat_q_f32()
andggml_compute_forward_rms_norm_f32()
, but can be easily applied to all multi-threaded ops