-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move BLAS to a separate backend #6210
Conversation
Will just need to adapt |
This comment was marked as off-topic.
This comment was marked as off-topic.
@mofosyne I appreciate that you are trying to help, but please don't do that on my PRs. I very often have not pushed local changes and I prefer to deal with the merge conflicts myself. |
2b5c73d
to
ca91205
Compare
@ggerganov I am thinking about how accelerate should interact with the BLAS backend. I think this would make sense:
Conversely:
Currently |
Yes, that makes sense. With only |
On M2 Ultra there is a similar effect with the LLAMA_NO_LLAMAFILE=1 LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master sl/blas-backend -m models/tinyllama-1b/ggml-model-q4_0.gguf -m models/tinyllama-1b/ggml-model-q8_0.gguf -m models/tinyllama-1b/ggml-model-f16.gguf -m models/tinyllama-1b/ggml-model-f32.gguf -p 32,64,128,256,512 -n 0 -t 4,8,16
|
I realized that there is an issue that causes the
|
This should be good now. I have updated the PR description with more details about the changes included here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading
On macOS with Metal enabled, when I build with LLAMA_BLAS=OFF
and run with partial offloading (-ngl 28
), the non-offloaded layers are running on the CPU backend:
...
node # 32 ( ADD): l_out-0 ( 8M) [ CPU ]: ffn_out-0 ( 8M) [ CPU ] ffn_inp-0 ( 8M) [ CPU ]
node # 33 ( RMS_NORM): norm-1 ( 8M) [ CPU ]: l_out-0 ( 8M) [ CPU ]
node # 34 ( MUL): attn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [ CPU ] blk.1.attn_norm.weig ( 16K) [ CPU ]
node # 35 ( MUL_MAT): Qcur-1 ( 8M) [ CPU ]: blk.1.attn_q.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 37 ( ROPE): Qcur-1 ( 8M) [ CPU ]: Qcur-1 (reshaped) ( 8M) [ CPU ] inp_pos ( 2K) [ CPU ]
node # 38 ( MUL_MAT): Kcur-1 ( 8M) [ CPU ]: blk.1.attn_k.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 40 ( ROPE): Kcur-1 ( 8M) [ CPU ]: Kcur-1 (reshaped) ( 8M) [ CPU ] inp_pos ( 2K) [ CPU ]
node # 41 ( MUL_MAT): Vcur-1 ( 8M) [ CPU ]: blk.1.attn_v.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
node # 43 ( CPY): k_cache_view-1 (copy ( 4M) [ CPU ]: Kcur-1 ( 8M) [ CPU ] k_cache_view-1 ( 4M) [ CPU ]
node # 46 ( CPY): v_cache_view-1 (copy ( 4M) [ CPU ]: Vcur-1 (transposed) ( 8M) [ CPU ] v_cache_view-1 ( 4M) [ CPU ]
node # 50 ( MUL_MAT): kq-1 ( 32M) [ CPU ]: k-1 ( 4M) [ CPU ] q-1 ( 8M) [ CPU ]
node # 51 ( SOFT_MAX): kq_soft_max_ext-1 ( 32M) [ CPU ]: kq-1 ( 32M) [ CPU ] KQ_mask ( 1M) [ CPU ]
node # 52 ( MUL_MAT): kqv-1 ( 8M) [ CPU ]: v-1 ( 4M) [ CPU ] kq_soft_max_ext-1 ( 32M) [ CPU ]
node # 54 ( CONT): kqv_merged_cont-1 ( 8M) [ CPU ]: kqv_merged-1 ( 8M) [ CPU ]
node # 55 ( MUL_MAT): kqv_out-1 ( 8M) [ CPU ]: blk.1.attn_output.we ( 17M) [ CPU ] kqv_merged_cont-1 ( 8M) [ CPU ]
node # 56 ( ADD): ffn_inp-1 ( 8M) [ CPU ]: kqv_out-1 ( 8M) [ CPU ] l_out-0 ( 8M) [ CPU ]
node # 57 ( RMS_NORM): norm-1 ( 8M) [ CPU ]: ffn_inp-1 ( 8M) [ CPU ]
node # 58 ( MUL): ffn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [ CPU ] blk.1.ffn_norm.weigh ( 16K) [ CPU ]
node # 59 ( MUL_MAT): ffn_gate-1 ( 21M) [ CPU ]: blk.1.ffn_gate.weigh ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
node # 60 ( UNARY): ffn_silu-1 ( 21M) [ CPU ]: ffn_gate-1 ( 21M) [ CPU ]
node # 61 ( MUL_MAT): ffn_up-1 ( 21M) [ CPU ]: blk.1.ffn_up.weight ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
node # 62 ( MUL): ffn_gate_par-1 ( 21M) [ CPU ]: ffn_silu-1 ( 21M) [ CPU ] ffn_up-1 ( 21M) [ CPU ]
node # 63 ( MUL_MAT): ffn_out-1 ( 8M) [ CPU ]: blk.1.ffn_down.weigh ( 45M) [ CPU ] ffn_gate_par-1 ( 21M) [ CPU ]
node # 64 ( ADD): l_out-1 ( 8M) [ CPU ]: ffn_out-1 ( 8M) [ CPU ] ffn_inp-1 ( 8M) [ CPU ]
node # 65 ( RMS_NORM): norm-2 ( 8M) [ CPU ]: l_out-1 ( 8M) [ CPU ]
...
With LLAMA_BLAS=ON
it uses the BLAS backend for the matrix multiplications:
...
## SPLIT #16: Metal # 1 inputs: [ffn_out-0 ( 8M)]
node # 32 ( ADD): l_out-0 ( 8M) [Metal ]: Metal#ffn_out-0#0 ( 8M) [ NULL ] ffn_inp-0 ( 8M) [Metal ]
node # 33 ( RMS_NORM): norm-1 ( 8M) [Metal ]: l_out-0 ( 8M) [Metal ]
## SPLIT #17: CPU # 0 inputs:
node # 34 ( MUL): attn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [Metal ] blk.1.attn_norm.weig ( 16K) [ CPU ]
## SPLIT #18: BLAS # 0 inputs:
node # 35 ( MUL_MAT): Qcur-1 ( 8M) [ BLAS ]: blk.1.attn_q.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #19: Metal # 1 inputs: [Qcur-1 (reshaped) ( 8M)]
node # 37 ( ROPE): Qcur-1 ( 8M) [Metal ]: Metal#Qcur-1 (reshap ( 8M) [ NULL ] Metal#inp_pos#0 ( 2K) [ NULL ]
## SPLIT #20: BLAS # 0 inputs:
node # 38 ( MUL_MAT): Kcur-1 ( 8M) [ BLAS ]: blk.1.attn_k.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #21: Metal # 1 inputs: [Kcur-1 (reshaped) ( 8M)]
node # 40 ( ROPE): Kcur-1 ( 8M) [Metal ]: Metal#Kcur-1 (reshap ( 8M) [ NULL ] Metal#inp_pos#0 ( 2K) [ NULL ]
## SPLIT #22: BLAS # 0 inputs:
node # 41 ( MUL_MAT): Vcur-1 ( 8M) [ BLAS ]: blk.1.attn_v.weight ( 17M) [ CPU ] attn_norm-1 ( 8M) [ CPU ]
## SPLIT #23: CPU # 0 inputs:
node # 43 ( CPY): k_cache_view-1 (copy ( 4M) [ CPU ]: Kcur-1 ( 8M) [Metal ] k_cache_view-1 ( 4M) [ CPU ]
node # 46 ( CPY): v_cache_view-1 (copy ( 4M) [ CPU ]: Vcur-1 (transposed) ( 8M) [ BLAS ] v_cache_view-1 ( 4M) [ CPU ]
## SPLIT #24: Metal # 2 inputs: [k-1 ( 4M)] [v-1 ( 4M)]
node # 50 ( MUL_MAT): kq-1 ( 32M) [Metal ]: Metal#k-1#0 ( 4M) [ NULL ] q-1 ( 8M) [Metal ]
node # 51 ( SOFT_MAX): kq_soft_max_ext-1 ( 32M) [Metal ]: kq-1 ( 32M) [Metal ] Metal#KQ_mask#0 ( 1M) [ NULL ]
node # 52 ( MUL_MAT): kqv-1 ( 8M) [Metal ]: Metal#v-1#0 ( 4M) [ NULL ] kq_soft_max_ext-1 ( 32M) [Metal ]
node # 54 ( CONT): kqv_merged_cont-1 ( 8M) [Metal ]: kqv_merged-1 ( 8M) [Metal ]
## SPLIT #25: BLAS # 0 inputs:
node # 55 ( MUL_MAT): kqv_out-1 ( 8M) [ BLAS ]: blk.1.attn_output.we ( 17M) [ CPU ] kqv_merged_cont-1 ( 8M) [Metal ]
## SPLIT #26: Metal # 1 inputs: [kqv_out-1 ( 8M)]
node # 56 ( ADD): ffn_inp-1 ( 8M) [Metal ]: Metal#kqv_out-1#0 ( 8M) [ NULL ] l_out-0 ( 8M) [Metal ]
node # 57 ( RMS_NORM): norm-1 ( 8M) [Metal ]: ffn_inp-1 ( 8M) [Metal ]
## SPLIT #27: CPU # 0 inputs:
node # 58 ( MUL): ffn_norm-1 ( 8M) [ CPU ]: norm-1 ( 8M) [Metal ] blk.1.ffn_norm.weigh ( 16K) [ CPU ]
## SPLIT #28: BLAS # 0 inputs:
node # 59 ( MUL_MAT): ffn_gate-1 ( 21M) [ BLAS ]: blk.1.ffn_gate.weigh ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
## SPLIT #29: Metal # 1 inputs: [ffn_gate-1 ( 21M)]
node # 60 ( UNARY): ffn_silu-1 ( 21M) [Metal ]: Metal#ffn_gate-1#0 ( 21M) [ NULL ]
## SPLIT #30: BLAS # 0 inputs:
node # 61 ( MUL_MAT): ffn_up-1 ( 21M) [ BLAS ]: blk.1.ffn_up.weight ( 45M) [ CPU ] ffn_norm-1 ( 8M) [ CPU ]
## SPLIT #31: Metal # 1 inputs: [ffn_up-1 ( 21M)]
node # 62 ( MUL): ffn_gate_par-1 ( 21M) [Metal ]: ffn_silu-1 ( 21M) [Metal ] Metal#ffn_up-1#0 ( 21M) [ NULL ]
## SPLIT #32: BLAS # 0 inputs:
node # 63 ( MUL_MAT): ffn_out-1 ( 8M) [ BLAS ]: blk.1.ffn_down.weigh ( 45M) [ CPU ] ffn_gate_par-1 ( 21M) [Metal ]
## SPLIT #33: Metal # 1 inputs: [ffn_out-1 ( 8M)]
node # 64 ( ADD): l_out-1 ( 8M) [Metal ]: Metal#ffn_out-1#0 ( 8M) [ NULL ] ffn_inp-1 ( 8M) [Metal ]
node # 65 ( RMS_NORM): norm-2 ( 8M) [Metal ]: l_out-1 ( 8M) [Metal ]
...
Is this the expectation? It seems like using BLAS together with GPU offloading leads to improvement in this case, or did I misunderstood this comment?
Specifically, this applies to backends that implement the |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that. |
pls consider my standalone PR for purpose of mixed inference between CPU&GPU / CPU&NPU if backend's ggml_backend_xx_buffer_is_host return true. |
@zhouwg I already considered it and rejected it. Spamming more about it is not going to help your cause. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you |
thanks for your reminder. I see. |
In that same example, if I allow the diff --git a/ggml-metal.m b/ggml-metal.m
index 7786acd6..665eae15 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -3178,6 +3178,12 @@ GGML_CALL static bool ggml_backend_metal_supports_buft(ggml_backend_t backend, g
UNUSED(backend);
}
+GGML_CALL static bool ggml_backend_metal_offload_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+ return (op->op == GGML_OP_MUL);
+
+ GGML_UNUSED(backend);
+}
+
static struct ggml_backend_i ggml_backend_metal_i = {
/* .get_name = */ ggml_backend_metal_name,
/* .free = */ ggml_backend_metal_free,
@@ -3193,7 +3199,7 @@ static struct ggml_backend_i ggml_backend_metal_i = {
/* .graph_compute = */ ggml_backend_metal_graph_compute,
/* .supports_op = */ ggml_backend_metal_supports_op,
/* .supports_buft = */ ggml_backend_metal_supports_buft,
- /* .offload_op = */ NULL,
+ /* .offload_op = */ ggml_backend_metal_offload_op,
/* .event_new = */ NULL,
/* .event_free = */ NULL,
/* .event_record = */ NULL, I get the following schedule:
How does the logic decide to also offload nodes |
In the first pass, ops with weights are assigned the backend of the weight. |
This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. ggml-ci
Moves BLAS support from
ggml.c
to a separate backend, and adds the necessary changes to ggml-backend to support backends that only implement matrix multiplication.ggml_backend_sched
supports_op
function of the backendsupports_backend
to backend functionsupports_buft
ggml_backend_buft_is_host
fromsupports_buft
ggml_backend_sched
will avoid copies between backends when the backend supports the buffer typeGGML_SCHED_DEBUG
environment variable can be used to view the graph splits. This is useful to see what operations are being run on each backend-t
or-tb
)LLAMA_BLAS
when using cmake, or when using make,LLAMA_OPENBLAS
,LLAMA_OPENBLAS64
orLLAMA_BLIS
ggml.c
. Applications that want to support BLAS will need to use the BLAS backendggml_backend_sched
alongside the CPU or other backends to provide support for other operations