rpc : copy tensors across servers #8032

rgerganov · 2024-06-20T10:37:34Z

This is an attempt to make copying tensors across servers more efficient. It introduces 2 new RPC commands:

HELLO - send after establishing connection to identify the remote party (client or server)
REMOTE_COPY_TENSOR - send to the host which contains the source tensor along with the destination tensor and destination endpoint

sequenceDiagram
    Note over Scheduler: Copy X on Server A to Y on Server B
    Scheduler->>Server A: REMOTE_COPY_TENSOR
    Server A->>Server B: HELLO
    Server A->>Server B: SET_TENSOR
    Server B-->>Server A: 
    Server A-->>Scheduler:

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Start a dedicated backend thread in the rpc-server and use message passing interface for submitting work to it. This will enable backend async operations and cross-server communication.

Add new cmd REMOTE_COPY_TENSOR for copying a tensor from one server to another.

slaren · 2024-06-21T12:18:53Z

ggml-rpc.cpp

    ggml_backend_rpc_buffer_context * dst_ctx = (ggml_backend_rpc_buffer_context *)dst_buffer->context;
    if (src_ctx->sock != dst_ctx->sock) {
-        return false;
+        return remote_copy_tensor(src, dst);


In cpy_tensor you can only assume that the dst tensor is allocated in buffer. The src tensor may be allocated in any other buffer, including in a different buffer type from a different backend. You cannot assume that the type of src_buffer->context is ggml_backend_rpc_buffer_context because it may be a different buffer type, so you need to check for that.

slaren · 2024-06-21T12:21:31Z

ggml-rpc.cpp

+    memcpy(input.data(), &rpc_src, sizeof(rpc_src));
+    memcpy(input.data() + sizeof(rpc_src), &rpc_dst, sizeof(rpc_dst));
+    uint32_t dst_endpoint_size = dst_ctx->endpoint.size();
+    memcpy(input.data() + 2*sizeof(rpc_tensor), &dst_endpoint_size, sizeof(dst_endpoint_size));
+    memcpy(input.data() + 2*sizeof(rpc_tensor) + sizeof(dst_endpoint_size), dst_ctx->endpoint.c_str(), dst_endpoint_size);


This kind of pattern is very quickly becoming unreadable, which makes the code very hard to review. My suggestion is to make structs for all the messages/commands.

Agree, may be we can have a common rpc command base structure, include size, command enum, and then each command can derived from it, append its own parameters, just like this example:

enum rpc_cmd { ... } struct rpc_cmd_base { uint32_t size: uint32_t version; // version number for the rpc command rpc_cmd cmd; }; struct rpc_cmd_xx: rpc_cmd_base { uint32_t param1; ... };

lexasub · 2025-01-25T17:56:50Z

@rgerganov any updates on rpc? i did profiling, but no ideas for fast(small git --diff) optimization. i want see any refactoring on llama.cpp rpc

rgerganov · 2025-01-27T14:27:13Z

@lexasub I don't have any ideas how to speed things up with a single RPC server. With multiple RPC servers, you can try to resurrect this patch and see if it makes things better for your usecases. My benchmarks back in the day didn't show any significant improvements but I may have missed something.

lexasub · 2025-01-28T19:19:01Z

@rgerganov I attempted to rebase this branch to resolve conflicts with the latest upstream changes, but the scope of conflicts (especially in ggml.rpc.cpp and buffer context handling) suggests that manual adjustments might be unavoidable. I’ve started reworking some sections locally, but I’m concerned about diverging from your intended approach.

Question: Have you been working on a more up-to-date version of this branch? If so, could you share it or highlight key changes that need preservation? This would help ensure alignment and avoid redundant work.

rgerganov · 2025-01-29T07:53:16Z

Question: Have you been working on a more up-to-date version of this branch?

No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline.

Testing on the same physical host with servers running on localhost may not give relevant results.

lexasub · 2025-02-02T02:15:55Z

@rgerganov Challenges in implementing an output queue "pipeline" for the ggml client-server architecture have arisen due to the proximity of code that utilizes the output parameter in send_rpc_cmd.

The parameter is intended to be written by thread later, but integrating its usage at the appropriate point in the codebase has proven complex, particularly given limited familiarity with ggml's architecture. (how and when try get data(from thread? to some usage (usage is complex))

While the current focus is on achieving functionality, concerns remain about potential inefficiencies, such as waiting for the output to populate, which could hinder parallel processing on the server side.

The ongoing work can be tracked in the https://github.com/lexasub/llama.cpp/tree/async-rpc-squashed/ggml/src/ggml-rpc (draft)

lexasub · 2025-02-06T01:58:21Z

@rgerganov, I previously considered using gRPC here, but I can't yet say if it will have the desired effect. Does llama RPC transmit a lot of metadata (like field names and delimiters), or is everything packed as efficiently as possible (not in terms of pragma pack, but in terms of field names, as I mentioned)? If a significant amount of metadata (like names) is currently being transmitted, I'm willing to conduct research on gRPC. also we can try compress tensors before sending

rgerganov · 2025-02-06T14:54:32Z

My initial implementation of the RPC backend was using gRPC and switching to a custom binary serialization improved the performance a lot: #6829 (comment)

segmond · 2025-04-23T21:19:31Z

Question: Have you been working on a more up-to-date version of this branch?

No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline.

Testing on the same physical host with servers running on localhost may not give relevant results.

I have 3 hosts on a physical network and will be willing to test this if anyone picks it up. How is the tensor copied today? I'm guessing for each server back to the scheduler and then from the scheduler to the next server? By implementing this, this should cut down latency by half?

rgerganov added 2 commits June 20, 2024 12:53

rpc : enable async operations

d47e137

Start a dedicated backend thread in the rpc-server and use message passing interface for submitting work to it. This will enable backend async operations and cross-server communication.

rpc : copy tensors across servers

005cf2e

Add new cmd REMOTE_COPY_TENSOR for copying a tensor from one server to another.

rgerganov mentioned this pull request Jun 20, 2024

rpc : enable async operations #7915

Open

4 tasks

slaren reviewed Jun 21, 2024

View reviewed changes

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc : copy tensors across servers #8032

rpc : copy tensors across servers #8032

rgerganov commented Jun 20, 2024

Uh oh!

slaren Jun 21, 2024 •

edited

Loading

Uh oh!

slaren Jun 21, 2024 •

edited

Loading

Uh oh!

chraac Jun 22, 2024 •

edited

Loading

Uh oh!

lexasub commented Jan 25, 2025

Uh oh!

rgerganov commented Jan 27, 2025

Uh oh!

lexasub commented Jan 28, 2025

Uh oh!

rgerganov commented Jan 29, 2025

Uh oh!

lexasub commented Feb 2, 2025 •

edited

Loading

Uh oh!

lexasub commented Feb 6, 2025 •

edited

Loading

Uh oh!

rgerganov commented Feb 6, 2025

Uh oh!

segmond commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rpc : copy tensors across servers #8032

Are you sure you want to change the base?

rpc : copy tensors across servers #8032

Conversation

rgerganov commented Jun 20, 2024

Uh oh!

slaren Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chraac Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lexasub commented Jan 25, 2025

Uh oh!

rgerganov commented Jan 27, 2025

Uh oh!

lexasub commented Jan 28, 2025

Uh oh!

rgerganov commented Jan 29, 2025

Uh oh!

lexasub commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Feb 6, 2025

Uh oh!

segmond commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

slaren Jun 21, 2024 •

edited

Loading

slaren Jun 21, 2024 •

edited

Loading

chraac Jun 22, 2024 •

edited

Loading

lexasub commented Feb 2, 2025 •

edited

Loading

lexasub commented Feb 6, 2025 •

edited

Loading