Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : spread compute across threads in chunks #1507

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ggerganov
Copy link
Owner

This is an alternative way of distributing the work across workers. Not sure yet if it is more efficient.

The idea is for each thread to process small chunk of data (e.g. ~ nrows / (8*nth)) and then pick another chunk.
We make sure each thread processes at least 8 chunks of data.

Hoping to have more even distribution of work when using more threads and workaround threads having to wait for a single thread if it gets delayed

Currently used in ggml_compute_forward_mul_mat_q_f32() and ggml_compute_forward_rms_norm_f32(), but can be easily applied to all multi-threaded ops

@slaren
Copy link
Collaborator

slaren commented May 17, 2023

I am guessing that this would be useful for Intel CPUs with E cores, and maybe also for AMD X3D CPUs in which not all cores have the same amount of cache.

@ggerganov ggerganov added the threading Parallel processing and thread management label May 17, 2023
@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label May 20, 2023
@ggerganov
Copy link
Owner Author

I don't see a significant improvement using this. Maybe a tiny bit faster for 12-16 threads, but hard to say

@cmp-nct
Copy link
Contributor

cmp-nct commented May 21, 2023

On modern Intel CPUs it should be beneficial to assign shorter chunks to the smaller cores and bigger chunks to the performance cores, paired with affinity management.

I looked into that for a while and got distracted before I was finished (with GPU optimizations), my first findings have been quite interesting.
Especially when controlling the affinity of each thread performance gains were possible.

I can't say for 100% as I didn't finish debugging but to me it looked like, as soon as the atom cores are used they lag with the results causing the performance cores to stay idle.
Currently I hardcode for 8 p-cores, no additional threads and do not assign any work to atom cores.

I think there is a lot of optimization potential there, if you look at how little gains are happening when running with 32 threads (on a 32 thread CPU) compared to using just 4 or 5 threads. Most performance sits idle.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 20, 2023

@ggerganov I've just spent 2 hours testing this commit on a 13900k which has 8 performance and 16 atom cores, on Windows 11.
This commit brings a 9% speed improvement in most cases, 3% at worst.
I ran them for Falcon on CPU only but that should reflect over to llama very closely.

Most important findings

  1. Some models consistently inference 9% faster, some 3% faster. Not a single case was worse than that.
  2. This commit makes the -t thread count much less sensitive for 12 and 13 gen CPUs

Performance improvements:
I ran around 50 tests in total, it was consistently reproduceable
The large 40B 5_k went from 501ms inference/token to 460, the 7B went from 75ms to 63ms/token
I repeated all tests several times with both binaries alternating.

Stability improvements:
I have a slowdown of up to 50% when using a high thread count (16,31, 24). with this commit the slowdown is down to 10% at worst, it's much more consistent.
Best speed is with 7 threads on normal ggml and 7 or 8 threads with this commit.

This is a great improvement that got much too little attention.

@ggerganov
Copy link
Owner Author

I guess we can extend ggml to be able to choose work chunk distribution method - either at compile time, or via a context parameter. We can factor out the range selections from the ggml forward implementations to make implementation more concise and extensible in the future

The proposed method in this PR might be less efficient with NUMA or in some other cases that we haven't thought about, so there should be a way to fallback to the original method

@YellowRoseCx
Copy link
Contributor

This is an alternative way of distributing the work across workers. Not sure yet if it is more efficient.

The idea is for each thread to process small chunk of data (e.g. ~ nrows / (8*nth)) and then pick another chunk. We make sure each thread processes at least 8 chunks of data.

Hoping to have more even distribution of work when using more threads and workaround threads having to wait for a single thread if it gets delayed

Currently used in ggml_compute_forward_mul_mat_q_f32() and ggml_compute_forward_rms_norm_f32(), but can be easily applied to all multi-threaded ops

GGML.c has change a lot since this pull request was made; would it be possible to rework and add this chunked computing to the current GGML.c to help people that have efficiency cores along with performance cores?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged threading Parallel processing and thread management
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants