Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize dequantization or fp format conversion when using blas #4988

Closed
ReinForce-II opened this issue Jan 16, 2024 · 5 comments · Fixed by #5045
Closed

Parallelize dequantization or fp format conversion when using blas #4988

ReinForce-II opened this issue Jan 16, 2024 · 5 comments · Fixed by #5045
Labels
enhancement New feature or request

Comments

@ReinForce-II
Copy link
Contributor

In ggml_compute_forward_mul_mat@ggml.c, gemm parallelism is managed by blas library. However, this disables multithreading on dequantizing weights, which may be a bottleneck.

I have performed some ugly modifications for comparing performance.

› MKL_NUM_THREADS=8 ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | BLAS       |          8 | pp 512     |     55.24 ± 3.16 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | BLAS       |          8 | tg 128     |     21.12 ± 1.89 |

build: 862f5e4 (1886)

› MKL_NUM_THREADS=8 ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | BLAS       |          8 | pp 512     |     79.34 ± 5.23 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | BLAS       |          8 | tg 128     |     21.26 ± 1.83 |

modified

cb828c8

@ReinForce-II ReinForce-II added the enhancement New feature or request label Jan 16, 2024
@ReinForce-II ReinForce-II changed the title Parallelize dequantization or fp format convertion when using blas Parallelize dequantization or fp format conversion when using blas Jan 16, 2024
@ggerganov
Copy link
Owner

Thanks for looking into this!

The global atomic flag is indeed ugly, so we need to figure out something better. The limitation that GGML_TASK_INIT is called only for one of the threads is making this more difficult than it should be, so I wonder if we should try to improve this part first - i.e. be able to have multi-threaded initialization for ops.

I think wsize also has to be updated because wdata will now store all dequantized matrices, while on master it stores only one at a time.

@ReinForce-II
Copy link
Contributor Author

ReinForce-II commented Jan 18, 2024

The corresponding enhancedments with multi-threaded initialization implemented here branch
I wonder if these changes are acceptable.

*Master

› MKL_NUM_THREADS=12 taskset -c 0-15 ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 1,2,4,8,16 -n 0

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 1 pp 512 62.39 ± 4.33
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 2 pp 512 76.81 ± 6.11
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 4 pp 512 86.29 ± 7.66
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 8 pp 512 86.08 ± 7.75
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 16 pp 512 86.12 ± 7.74

build: 57e2a7a (1917)

*Modified

› MKL_NUM_THREADS=12 taskset -c 0-15 ./llama-bench -m Llama-2-7b-chat-q4km.gguf -t 1,2,4,8,16 -n 0

model size params backend threads test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 1 pp 512 61.13 ± 5.42
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 2 pp 512 91.53 ± 0.38
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 4 pp 512 117.08 ± 1.12
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 8 pp 512 111.21 ± 12.40
llama 7B Q4_K - Medium 3.80 GiB 6.74 B BLAS 16 pp 512 110.89 ± 12.32

build: 8643989 (1919)

Besides, here may have something to do since higher amount of threads is useful now.

@ReinForce-II
Copy link
Contributor Author

Besides, here may have something to do since higher amount of threads is useful now.

Num of threads is required be carefully configured <blas_threads> + <llama.cpp_threads> <= physical threads to avoid any performance penalty. Any better approach here?

@ggerganov
Copy link
Owner

@ReinForce-II I think you are on the right track. Would you like to open a Pull Request so we can review the changes in detail and suggest improvements if necessary?

@slaren I wonder if we can now utilize ggml-backend somehow to create a "BLAS" backend that "offloads" just the matrix multiplications in their own splits. This can resolve the thread concurrency problem between ggml threads and BLAS threads. Have you thought about such kind of approach? (this is low-prio)

@slaren
Copy link
Collaborator

slaren commented Jan 19, 2024

Yes, this is something that I would like to do. It should only require minor changes to ggml_backend_sched to support this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants