-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc : copy tensors across servers #8032
base: master
Are you sure you want to change the base?
Conversation
Start a dedicated backend thread in the rpc-server and use message passing interface for submitting work to it. This will enable backend async operations and cross-server communication.
Add new cmd REMOTE_COPY_TENSOR for copying a tensor from one server to another.
GGML_CALL static bool ggml_backend_rpc_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const ggml_tensor * src, ggml_tensor * dst) { | ||
// check if src and dst are on the same server | ||
ggml_backend_buffer_t src_buffer = src->buffer; | ||
ggml_backend_rpc_buffer_context * src_ctx = (ggml_backend_rpc_buffer_context *)src_buffer->context; | ||
ggml_backend_buffer_t dst_buffer = dst->buffer; | ||
ggml_backend_rpc_buffer_context * dst_ctx = (ggml_backend_rpc_buffer_context *)dst_buffer->context; | ||
if (src_ctx->sock != dst_ctx->sock) { | ||
return false; | ||
return remote_copy_tensor(src, dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cpy_tensor
you can only assume that the dst
tensor is allocated in buffer
. The src
tensor may be allocated in any other buffer, including in a different buffer type from a different backend. You cannot assume that the type of src_buffer->context
is ggml_backend_rpc_buffer_context
because it may be a different buffer type, so you need to check for that.
memcpy(input.data(), &rpc_src, sizeof(rpc_src)); | ||
memcpy(input.data() + sizeof(rpc_src), &rpc_dst, sizeof(rpc_dst)); | ||
uint32_t dst_endpoint_size = dst_ctx->endpoint.size(); | ||
memcpy(input.data() + 2*sizeof(rpc_tensor), &dst_endpoint_size, sizeof(dst_endpoint_size)); | ||
memcpy(input.data() + 2*sizeof(rpc_tensor) + sizeof(dst_endpoint_size), dst_ctx->endpoint.c_str(), dst_endpoint_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of pattern is very quickly becoming unreadable, which makes the code very hard to review. My suggestion is to make structs for all the messages/commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, may be we can have a common rpc command base structure, include size, command enum, and then each command can derived from it, append its own parameters, just like this example:
enum rpc_cmd {
...
}
struct rpc_cmd_base {
uint32_t size:
uint32_t version; // version number for the rpc command
rpc_cmd cmd;
};
struct rpc_cmd_xx: rpc_cmd_base {
uint32_t param1;
...
};
@rgerganov any updates on rpc? i did profiling, but no ideas for fast(small git --diff) optimization. i want see any refactoring on llama.cpp rpc |
@lexasub I don't have any ideas how to speed things up with a single RPC server. With multiple RPC servers, you can try to resurrect this patch and see if it makes things better for your usecases. My benchmarks back in the day didn't show any significant improvements but I may have missed something. |
@rgerganov I attempted to rebase this branch to resolve conflicts with the latest upstream changes, but the scope of conflicts (especially in ggml.rpc.cpp and buffer context handling) suggests that manual adjustments might be unavoidable. I’ve started reworking some sections locally, but I’m concerned about diverging from your intended approach. Question: Have you been working on a more up-to-date version of this branch? If so, could you share it or highlight key changes that need preservation? This would help ensure alignment and avoid redundant work. |
No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline. Testing on the same physical host with servers running on localhost may not give relevant results. |
@rgerganov Challenges in implementing an output queue "pipeline" for the ggml client-server architecture have arisen due to the proximity of code that utilizes the output parameter in send_rpc_cmd. The parameter is intended to be written by thread later, but integrating its usage at the appropriate point in the codebase has proven complex, particularly given limited familiarity with ggml's architecture. (how and when try get data(from thread? to some usage (usage is complex)) While the current focus is on achieving functionality, concerns remain about potential inefficiencies, such as waiting for the output to populate, which could hinder parallel processing on the server side. The ongoing work can be tracked in the https://github.com/lexasub/llama.cpp/tree/async-rpc-squashed/ggml/src/ggml-rpc (draft) |
@rgerganov, I previously considered using gRPC here, but I can't yet say if it will have the desired effect. Does llama RPC transmit a lot of metadata (like field names and delimiters), or is everything packed as efficiently as possible (not in terms of pragma pack, but in terms of field names, as I mentioned)? If a significant amount of metadata (like names) is currently being transmitted, I'm willing to conduct research on gRPC. also we can try compress tensors before sending |
My initial implementation of the RPC backend was using gRPC and switching to a custom binary serialization improved the performance a lot: #6829 (comment) |
This is an attempt to make copying tensors across servers more efficient. It introduces 2 new RPC commands:
HELLO
- send after establishing connection to identify the remote party (client or server)REMOTE_COPY_TENSOR
- send to the host which contains the source tensor along with the destination tensor and destination endpoint