Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc : copy tensors across servers #8032

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rgerganov
Copy link
Collaborator

This is an attempt to make copying tensors across servers more efficient. It introduces 2 new RPC commands:

  • HELLO - send after establishing connection to identify the remote party (client or server)
  • REMOTE_COPY_TENSOR - send to the host which contains the source tensor along with the destination tensor and destination endpoint
sequenceDiagram
    Note over Scheduler: Copy X on Server A to Y on Server B
    Scheduler->>Server A: REMOTE_COPY_TENSOR
    Server A->>Server B: HELLO
    Server A->>Server B: SET_TENSOR
    Server B-->>Server A: 
    Server A-->>Scheduler: 
Loading

Start a dedicated backend thread in the rpc-server and use message
passing interface for submitting work to it. This will enable backend
async operations and cross-server communication.
Add new cmd REMOTE_COPY_TENSOR for copying a tensor from one server to
another.
@rgerganov rgerganov mentioned this pull request Jun 20, 2024
4 tasks
GGML_CALL static bool ggml_backend_rpc_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const ggml_tensor * src, ggml_tensor * dst) {
// check if src and dst are on the same server
ggml_backend_buffer_t src_buffer = src->buffer;
ggml_backend_rpc_buffer_context * src_ctx = (ggml_backend_rpc_buffer_context *)src_buffer->context;
ggml_backend_buffer_t dst_buffer = dst->buffer;
ggml_backend_rpc_buffer_context * dst_ctx = (ggml_backend_rpc_buffer_context *)dst_buffer->context;
if (src_ctx->sock != dst_ctx->sock) {
return false;
return remote_copy_tensor(src, dst);
Copy link
Member

@slaren slaren Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cpy_tensor you can only assume that the dst tensor is allocated in buffer. The src tensor may be allocated in any other buffer, including in a different buffer type from a different backend. You cannot assume that the type of src_buffer->context is ggml_backend_rpc_buffer_context because it may be a different buffer type, so you need to check for that.

Comment on lines +458 to +462
memcpy(input.data(), &rpc_src, sizeof(rpc_src));
memcpy(input.data() + sizeof(rpc_src), &rpc_dst, sizeof(rpc_dst));
uint32_t dst_endpoint_size = dst_ctx->endpoint.size();
memcpy(input.data() + 2*sizeof(rpc_tensor), &dst_endpoint_size, sizeof(dst_endpoint_size));
memcpy(input.data() + 2*sizeof(rpc_tensor) + sizeof(dst_endpoint_size), dst_ctx->endpoint.c_str(), dst_endpoint_size);
Copy link
Member

@slaren slaren Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of pattern is very quickly becoming unreadable, which makes the code very hard to review. My suggestion is to make structs for all the messages/commands.

Copy link

@chraac chraac Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, may be we can have a common rpc command base structure, include size, command enum, and then each command can derived from it, append its own parameters, just like this example:

enum rpc_cmd {
    ...
}

struct rpc_cmd_base {
    uint32_t size:
    uint32_t version;  // version number for the rpc command
    rpc_cmd cmd;
};

struct rpc_cmd_xx: rpc_cmd_base  {
    uint32_t param1;
    ...
};

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024
@lexasub
Copy link
Contributor

lexasub commented Jan 25, 2025

@rgerganov any updates on rpc? i did profiling, but no ideas for fast(small git --diff) optimization. i want see any refactoring on llama.cpp rpc

@rgerganov
Copy link
Collaborator Author

@lexasub I don't have any ideas how to speed things up with a single RPC server. With multiple RPC servers, you can try to resurrect this patch and see if it makes things better for your usecases. My benchmarks back in the day didn't show any significant improvements but I may have missed something.

@lexasub
Copy link
Contributor

lexasub commented Jan 28, 2025

@rgerganov I attempted to rebase this branch to resolve conflicts with the latest upstream changes, but the scope of conflicts (especially in ggml.rpc.cpp and buffer context handling) suggests that manual adjustments might be unavoidable. I’ve started reworking some sections locally, but I’m concerned about diverging from your intended approach.

Question: Have you been working on a more up-to-date version of this branch? If so, could you share it or highlight key changes that need preservation? This would help ensure alignment and avoid redundant work.

@rgerganov
Copy link
Collaborator Author

Question: Have you been working on a more up-to-date version of this branch?

No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline.

Testing on the same physical host with servers running on localhost may not give relevant results.

@lexasub
Copy link
Contributor

lexasub commented Feb 2, 2025

@rgerganov Challenges in implementing an output queue "pipeline" for the ggml client-server architecture have arisen due to the proximity of code that utilizes the output parameter in send_rpc_cmd.

The parameter is intended to be written by thread later, but integrating its usage at the appropriate point in the codebase has proven complex, particularly given limited familiarity with ggml's architecture. (how and when try get data(from thread? to some usage (usage is complex))

While the current focus is on achieving functionality, concerns remain about potential inefficiencies, such as waiting for the output to populate, which could hinder parallel processing on the server side.

The ongoing work can be tracked in the https://github.com/lexasub/llama.cpp/tree/async-rpc-squashed/ggml/src/ggml-rpc (draft)

@lexasub
Copy link
Contributor

lexasub commented Feb 6, 2025

@rgerganov, I previously considered using gRPC here, but I can't yet say if it will have the desired effect. Does llama RPC transmit a lot of metadata (like field names and delimiters), or is everything packed as efficiently as possible (not in terms of pragma pack, but in terms of field names, as I mentioned)? If a significant amount of metadata (like names) is currently being transmitted, I'm willing to conduct research on gRPC. also we can try compress tensors before sending

@rgerganov
Copy link
Collaborator Author

My initial implementation of the RPC backend was using gRPC and switching to a custom binary serialization improved the performance a lot: #6829 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants