-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: parallelize dequantization or fp format conversion when using blas #5045
ggml: parallelize dequantization or fp format conversion when using blas #5045
Conversation
8643989
to
2c0ed7a
Compare
This is good as it is, but in the future I think we could remove the init and finalize tasks, and instead have a barrier synchronization function that waits until all the threads working on an op reach the same point. Then the ops could have as many phases as they need, and it might simplify the implementation a bit. |
* update outdated comment * fix coding style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced about the sched_yield
changes - on macOS I do see performance improvements with CPU + Accelerate, but on my Linux machine with OpenBLAS, whisper.cpp
is slower with this change.
I guess we will merge it and look into a more general solution of removing the tasks and synchronizing on barriers
On Linux with OpenBLAS, when it's running with all available threads by default, a spin not managed in blas lib can compete with it, which can lead to significant downgraded performance. Num threads used by openblas can be set via utility functions in openblas extension, but an equivlant function is not always available in general blas providers. I guess the case can be relieved by environment variable, but it still need something better here. |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make GGML_TASK_INIT phase can be run in multithread * multithreaded dequantize in mul_mat when using blas library * minor fixes * update outdated comment * fix coding style * simplify code Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make GGML_TASK_INIT phase can be run in multithread * multithreaded dequantize in mul_mat when using blas library * minor fixes * update outdated comment * fix coding style * simplify code Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR seems to have caused a large performance regression for at least one user: #6417 |
resolve #4988