ggml-backend: backend-agnostic tensor parallelism #13776

JohannesGaessler · 2025-05-25T12:20:08Z

I'm currently working on support for backend-agnostic tensor parallelism. I've progressed to the point where I have a working prototype (that only works for 2 GPUs and has bad performance). I'm making this PR in order to get early feedback regarding the way I would implement it, input from @slaren in particular would be appreciated. Specifically I would:

Add a backend-agnostic interface for split buffers to ggml-backend.cpp to e.g. check whether a buffer is split, which backends are associated with it if it is, and to retrieve the effective tensor for a given backend. I think this can be done without any backend-specific code. The input would be multiple backend buffers, when allocating a tensor on the split buffer this would be translated to allocating slices of the tensor on the underlying backend buffers.
Refactor the code for ggml_backend_sched to revolve more around splits instead of the nodes from the original graph. Without tensor parallelism there will be effectively no change because the splits just contain all nodes from the original graph in sequential order. So the same results should be achieved by iterating over splits vs. iterating over nodes.
When using tensor parallelism, split the graph at additional points and duplicate the splits in such a way that some operations can run in parallel across multiple backends. The existing code for pipeline parallelism can be re-used to handle the scheduling, data transfer, and synchronization. To combine the results from multiple backends the current solution is to copy the partial results from other backends and to then use GGML_CONCAT to combine them into a tensor that contains the correct data. For this I extended the functionality of ggml_backend_sched_split::inputs. Tensors with GGML_OP_NONE use the existing code to retrieve data from other backends. Tensors with other ops are executed prior to the actual nodes from the split.
Extend the logic for split tensors to cover not just a split by dimension 1 but of the other dimensions + mirrored data as well. It will be not just weights that can be split but nodes as well. Define a function similar to the _supports_op functions to determine the state of split tensors after some op given the states of the inputs. If an op cannot be meaningfully executed in parallel, synchronize the nodes as a fallback. This should ensure that correct results can always be produced, but with bad performance if the correct transformation logic is not defined. For the attention I think the graph should be split by dimension 2, for the FFN part I think it should be dimension 1 -> dimension 0 -> mirrored. In total there would need to be 4 synchronizations per layer.

Going forward, since ggml_backend_sched is a critical component I would first make a separate PR to refactor it slightly so that it's easier to assert that no changes are being made for use without tensor parallelism. The approach I have in this PR is to first split the graph and to create a vector of sequential splits splits_no_tp where splits that need tensor parallelism are marked. Then in a second pass a vector splits_tp is created where tensor parallel splits are duplicated. Only after this are inputs being assigned. Finally, the vector splits_tp is copied to ggml_backend_sched::splits. So in effect I have split the 5th pass over the graph nodes into 2 passes where I can duplicate the tensor parallel splits inbetween. I used vectors because it made the implementation the easiest, but it should be possible to do the same thing with one more allocation like ggml_backend_sched::splits that grows dynamically when needed. I assume the reason a vector is not used in the current code for ggml_backend_sched::splits is to assert that the memory is never reallocated when repeatedly changing the number of splits.

For the main PR the goal would be to get an implementation that is at least as fast as the current CUDA code for --split-mode row but does not need code specific to the CUDA backend. This then makes it possible to remove ggml_cuda_op_mul_mat without loss of functionality.

JohannesGaessler added 30 commits May 17, 2025 12:52

WIP

0c90859

WIP

838f577

WIP

aedc3f7

try fix

99bb015

WIP

06d7a88

WIP

0a69555

WIP

7563db8

WIP

363e237

fix

47e6d24

WIP

751e488

WIP

3f8f323

WIP

316ef4e

try fix

47b228f

try fix

cf4d0b6

try fix

bb48a90

WIP

016405b

WIP

7c17ff1

WIP

deda9c2

WIP

3c1291f

WIP

16d29fe

WIP

7468e9d

WIP

119657a

WIP

50d2c5e

try fix

fe2747e

try fix

6ddf206

WIP

996d263

WIP

3a432ab

WIP

9c6550e

WIP

2da2cc3

WIP

67f02bf

JohannesGaessler added 28 commits May 23, 2025 21:23

WIP

6ee4d0e

WIP

935d652

WIP

4dacb2f

WIP

7b7f399

WIP

aeda7e0

WIP

95f1caf

WIP

1f648ba

WIP

e18d1ef

WIP

66c8eec

WIP

206ab58

WIP

ae1617c

WIP

f617bbb

WIP

528dd51

WIP

4006293

WIP

943456b

WIP

25c25ea

WIP

739d902

WIP

26807a9

WIP

1c9dcde

WIP

3c21fdd

WIP

9719003

WIP

1c37a20

WIP

f6dd08e

WIP

02e4af1

WIP

07ca4b8

WIP

c0358bd

WIP

ea3cab5

WIP

027d97e

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-backend: backend-agnostic tensor parallelism #13776

ggml-backend: backend-agnostic tensor parallelism #13776

JohannesGaessler commented May 25, 2025

Uh oh!

Uh oh!

ggml-backend: backend-agnostic tensor parallelism #13776

Are you sure you want to change the base?

ggml-backend: backend-agnostic tensor parallelism #13776

Conversation

JohannesGaessler commented May 25, 2025

Uh oh!

Uh oh!