Send/receive operators via MPI #340

evanmiller · 2023-07-04T15:22:28Z

Companion PR to ggml-org/llama.cpp#2099. This PR adds support for distributed computation with new OP_SEND and OP_RECV operators for passing tensors between compute nodes via MPI.

The substantial changes to this repository are:

New GGML_MPI compile-time option
Replace char padding[4] with int tag in the ggml_tensor struct; this is used to indicate the source / destination nodes for passing tensors (while preserving the overall struct size)
New send and receive graph operators

If MPI is not added at compile-time, send and receive will choke with assertion failures at graph-computation time. It is the responsibility of the caller to construct correct partial computation graphs on each MPI node.

Currently only float32 tensors are supported. This is sufficient for llama.cpp, but support for other types can be added later. While GGML does not appear to support big-endian architectures, I'd like to keep the messages endian- and architecture-agnostic by using MPI's native data types where possible.

I tested these changes thoroughly in the above-linked PR; in this repo I simply tested through build.

slaren · 2023-07-04T15:34:17Z

I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.

evanmiller · 2023-07-04T15:56:38Z

I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.

I suspect that "automatic distribution" across multiple hosts will be difficult to implement... Besides the graph serialization/deserialization headaches, the scheduler will need to reason about the appropriate cut-points based on tensor sizes in order to maximize data locality and minimize network communication. IMHO the developer is in a much better position to optimize the graph for memory usage and communication overhead across the cluster. If you look at the linked PR, I was able to distribute llama.cpp using send/receive primitives with a fairly small amount of code; but this hides the amount of thinking that went in to choosing appropriate places to cut the computation graph.

slaren · 2023-07-04T16:15:49Z

The idea is not to make the splits automatically, the programmer will still need to choose where to make these splits, and the user will need to specify what backend to use for each split.
It would definitely be more complicated since it is a more general approach, but it would be more flexible, it would simplify the code for the users, and it would avoid code duplication. Your approach also depends on mmap to load only the required parts of the models on each node, but that's not going to work with GPUs.

evanmiller · 2023-07-04T16:24:37Z

OK, I think you have a much clearer idea of what a unified API would look like than I do. Would be great to include MPI in that discussion.

slaren · 2023-07-04T16:38:00Z

My goal for now is to implement this for CUDA in llama.cpp, and to show how this could be done to split computation between the CPU and CUDA in a clean way that can be extended to other backends. I would like to implement that first as a proof of concept, and after that we can work on defining a common interface to all the compute backends, including CPU, CUDA, Vulkan, Metal, OpenCL, and MPI. That may take a few weeks at least, though.

evanmiller · 2023-07-07T00:45:19Z

Closing in favor of approach described in ggml-org/llama.cpp#2099 (comment)

MPI support

fac11e7

evanmiller mentioned this pull request Jul 5, 2023

Distributed inference via MPI ggml-org/llama.cpp#2099

Merged

evanmiller closed this Jul 7, 2023

cgisky1980 mentioned this pull request Jul 17, 2023

Support for MPI RWKV/rwkv.cpp#116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Send/receive operators via MPI #340

Send/receive operators via MPI #340

Uh oh!

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 7, 2023

Uh oh!

Uh oh!

Send/receive operators via MPI #340

Send/receive operators via MPI #340

Uh oh!

Conversation

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

evanmiller commented Jul 7, 2023

Uh oh!

Uh oh!