-
Notifications
You must be signed in to change notification settings - Fork 11.9k
ggml-backend: backend-agnostic tensor parallelism #13776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
JohannesGaessler
wants to merge
65
commits into
ggml-org:master
Choose a base branch
from
JohannesGaessler:backend-tensor-parallel
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
ggml-backend: backend-agnostic tensor parallelism #13776
JohannesGaessler
wants to merge
65
commits into
ggml-org:master
from
JohannesGaessler:backend-tensor-parallel
+447
−150
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm currently working on support for backend-agnostic tensor parallelism. I've progressed to the point where I have a working prototype (that only works for 2 GPUs and has bad performance). I'm making this PR in order to get early feedback regarding the way I would implement it, input from @slaren in particular would be appreciated. Specifically I would:
ggml-backend.cpp
to e.g. check whether a buffer is split, which backends are associated with it if it is, and to retrieve the effective tensor for a given backend. I think this can be done without any backend-specific code. The input would be multiple backend buffers, when allocating a tensor on the split buffer this would be translated to allocating slices of the tensor on the underlying backend buffers.ggml_backend_sched
to revolve more around splits instead of the nodes from the original graph. Without tensor parallelism there will be effectively no change because the splits just contain all nodes from the original graph in sequential order. So the same results should be achieved by iterating over splits vs. iterating over nodes.GGML_CONCAT
to combine them into a tensor that contains the correct data. For this I extended the functionality ofggml_backend_sched_split::inputs
. Tensors withGGML_OP_NONE
use the existing code to retrieve data from other backends. Tensors with other ops are executed prior to the actual nodes from the split._supports_op
functions to determine the state of split tensors after some op given the states of the inputs. If an op cannot be meaningfully executed in parallel, synchronize the nodes as a fallback. This should ensure that correct results can always be produced, but with bad performance if the correct transformation logic is not defined. For the attention I think the graph should be split by dimension 2, for the FFN part I think it should be dimension 1 -> dimension 0 -> mirrored. In total there would need to be 4 synchronizations per layer.Going forward, since
ggml_backend_sched
is a critical component I would first make a separate PR to refactor it slightly so that it's easier to assert that no changes are being made for use without tensor parallelism. The approach I have in this PR is to first split the graph and to create a vector of sequential splitssplits_no_tp
where splits that need tensor parallelism are marked. Then in a second pass a vectorsplits_tp
is created where tensor parallel splits are duplicated. Only after this are inputs being assigned. Finally, the vectorsplits_tp
is copied toggml_backend_sched::splits
. So in effect I have split the 5th pass over the graph nodes into 2 passes where I can duplicate the tensor parallel splits inbetween. I used vectors because it made the implementation the easiest, but it should be possible to do the same thing with one more allocation likeggml_backend_sched::splits
that grows dynamically when needed. I assume the reason a vector is not used in the current code forggml_backend_sched::splits
is to assert that the memory is never reallocated when repeatedly changing the number of splits.For the main PR the goal would be to get an implementation that is at least as fast as the current CUDA code for
--split-mode row
but does not need code specific to the CUDA backend. This then makes it possible to removeggml_cuda_op_mul_mat
without loss of functionality.