Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: parallelize dequantization or fp format conversion when using blas #5045

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

ReinForce-II
Copy link
Contributor

resolve #4988

  • make GGML_TASK_INIT phase can be run in multithread
  • move dequantization in mul_mat to GGML_TASK_INIT phase when using BLAS lib and enable multithreading
  • update work size to fit fully dequantized weights

@ReinForce-II ReinForce-II marked this pull request as ready for review January 19, 2024 22:34
@ReinForce-II ReinForce-II force-pushed the blas-parallelize-to-float-2 branch from 8643989 to 2c0ed7a Compare January 20, 2024 14:01
@slaren
Copy link
Collaborator

slaren commented Jan 20, 2024

This is good as it is, but in the future I think we could remove the init and finalize tasks, and instead have a barrier synchronization function that waits until all the threads working on an op reach the same point. Then the ops could have as many phases as they need, and it might simplify the implementation a bit.

ggml.c Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
* update outdated comment
* fix coding style
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced about the sched_yield changes - on macOS I do see performance improvements with CPU + Accelerate, but on my Linux machine with OpenBLAS, whisper.cpp is slower with this change.

I guess we will merge it and look into a more general solution of removing the tasks and synchronizing on barriers

@ReinForce-II
Copy link
Contributor Author

On Linux with OpenBLAS, when it's running with all available threads by default, a spin not managed in blas lib can compete with it, which can lead to significant downgraded performance. Num threads used by openblas can be set via utility functions in openblas extension, but an equivlant function is not always available in general blas providers. I guess the case can be relieved by environment variable, but it still need something better here.

ggml.c Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggerganov ggerganov merged commit 780e24a into ggerganov:master Jan 22, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@JohannesGaessler
Copy link
Collaborator

This PR seems to have caused a large performance regression for at least one user: #6417

@ReinForce-II ReinForce-II deleted the blas-parallelize-to-float-2 branch May 23, 2024 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize dequantization or fp format conversion when using blas
5 participants