Skip to content

Conversation

evanmiller
Copy link

Companion PR to ggml-org/llama.cpp#2099. This PR adds support for distributed computation with new OP_SEND and OP_RECV operators for passing tensors between compute nodes via MPI.

The substantial changes to this repository are:

  • New GGML_MPI compile-time option
  • Replace char padding[4] with int tag in the ggml_tensor struct; this is used to indicate the source / destination nodes for passing tensors (while preserving the overall struct size)
  • New send and receive graph operators

If MPI is not added at compile-time, send and receive will choke with assertion failures at graph-computation time. It is the responsibility of the caller to construct correct partial computation graphs on each MPI node.

Currently only float32 tensors are supported. This is sufficient for llama.cpp, but support for other types can be added later. While GGML does not appear to support big-endian architectures, I'd like to keep the messages endian- and architecture-agnostic by using MPI's native data types where possible.

I tested these changes thoroughly in the above-linked PR; in this repo I simply tested through build.

@slaren
Copy link
Member

slaren commented Jul 4, 2023

I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.

@evanmiller
Copy link
Author

I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.

I suspect that "automatic distribution" across multiple hosts will be difficult to implement... Besides the graph serialization/deserialization headaches, the scheduler will need to reason about the appropriate cut-points based on tensor sizes in order to maximize data locality and minimize network communication. IMHO the developer is in a much better position to optimize the graph for memory usage and communication overhead across the cluster. If you look at the linked PR, I was able to distribute llama.cpp using send/receive primitives with a fairly small amount of code; but this hides the amount of thinking that went in to choosing appropriate places to cut the computation graph.

@slaren
Copy link
Member

slaren commented Jul 4, 2023

The idea is not to make the splits automatically, the programmer will still need to choose where to make these splits, and the user will need to specify what backend to use for each split.
It would definitely be more complicated since it is a more general approach, but it would be more flexible, it would simplify the code for the users, and it would avoid code duplication. Your approach also depends on mmap to load only the required parts of the models on each node, but that's not going to work with GPUs.

@evanmiller
Copy link
Author

OK, I think you have a much clearer idea of what a unified API would look like than I do. Would be great to include MPI in that discussion.

@slaren
Copy link
Member

slaren commented Jul 4, 2023

My goal for now is to implement this for CUDA in llama.cpp, and to show how this could be done to split computation between the CPU and CUDA in a clean way that can be extended to other backends. I would like to implement that first as a proof of concept, and after that we can work on defining a common interface to all the compute backends, including CPU, CUDA, Vulkan, Metal, OpenCL, and MPI. That may take a few weeks at least, though.

@evanmiller
Copy link
Author

Closing in favor of approach described in ggml-org/llama.cpp#2099 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants