-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1][P/D] An native implementation of xPyD based on P2P NCCL #18242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
I did initial performance benchmarking. It seems to be functional. Please clean up the log in Perf TL;DR: still have room for optimization (~30ms overhead for 10000 tokens) Some performance number: Perf of this PR: Perf of Dynamo (an unofficial version in old vLLM PR) 2P1D, workload: Perf of this PR Perf of Dynamo (unofficial version): |
Signed-off-by: Abatom <abzhonghua@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM --- and I can do a round of docs and and write an one-click example after this PR get merged.
@cyber-pioneer I didn't quite understand your question. In your table, when the request rate is equal to 4, the P99 TTFT has suddenly increased, indicating that the service has reached its bottleneck. The longer the pressure test runs, the larger the TTFT will become. |
Thanks for you response. You are right. When request rate is set 4, all latency metrics have severely degraded. 2 might be a more balanced choice, especially for throughput performance. |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a native implementation of xPyD based on P2P NCCL, enabling dynamic scaling for vLLM instances. It includes changes to the documentation, proxy server, NCCL wrapper, KV connector factory, P2P NCCL connector, tensor memory pool, and NCCL engine. The changes aim to improve performance and scalability for distributed inference.
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
7 similar comments
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
…oject#18242) Signed-off-by: Abatom <abzhonghua@gmail.com>
…oject#18242) Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>
…oject#18242) Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
…oject#18242) Signed-off-by: Abatom <abzhonghua@gmail.com>
An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo.
Detailed Design
Overall Process
As shown in Figure 1, the overall process of this PD disaggregation solution is described through a request flow:
/v1/completions
interface.request_id
(rules to be introduced later), modifies themax_tokens
in the HTTP request message to 1, and then forwards the request to the P instance.zmq_addr
can be resolved through therequest_id
.kv_buffer_size
. When the GPU buffer is full, the KV cache is stored in the local Tensor memory pool.Proxy/Router (Demo)
A simple HTTP service acts as the entry point for client requests and starts a background thread to listen for P/D instances reporting their HTTP IP and PORT, as well as ZMQ IP and PORT. It maintains a dictionary of
http_addr -> zmq_addr
. Thehttp_addr
is the IP:PORT for the vLLM instance's request, while thezmq_addr
is the address for KV cache handshake and metadata reception.The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding
request_id
, for example:Currently, to quickly verify whether xPyD can work, a round-robin selection of 1P1D is used. In the future, it is planned to use a trie combined with the load status of instances to select appropriate P and D.
Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (currently every 3 seconds) to register (i.e., report
http_addr -> zmq_addr
) and keep the connection alive. If an instance crashes and fails to send a ping for a certain period of time, the Proxy/Router will remove the timed-out instance (this feature has not yet been developed).KV Cache Transfer Methods
There are three methods for KVcache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the
--kv-transfer-config
andkv_connector_extra_config
parameters, specifically through thesend_type
field. Both PUT and PUT_ASYNC involve the P instance actively sending KVcache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache.Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
P2P Communication via ZMQ & NCCL
As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart.
Each P/D instance only needs to create a single
P2pNcclEngine
instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on thezmq_addr
address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVcache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVcache data itself.When a P instance and a D instance transmit KVcache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVcache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVcache transmission can be performed, without being restricted by rank or world size.
NCCL Group Topology
Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVcache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance.
Each NCCL group occupies a certain amount of GPU memory buffer for communication, the size of which is primarily influenced by the
NCCL_MAX_NCHANNELS
environment variable. WhenNCCL_MAX_NCHANNELS=16
, an NCCL group typically occupies 100MB, while whenNCCL_MAX_NCHANNELS=8
, it usually takes up 52MB. For large-scale xPyD configurations—such as DeepSeek's 96P144D—this implementation is currently not feasible. Moving forward, we are considering using RDMA for point-to-point communication and are also keeping an eye on UCCL.GPU Memory Buffer and Tensor Memory Pool
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter
kv_buffer_size
, measured in bytes, and is typically set to 5%~10% of the memory size.If the
--max-num-seqs
parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVcache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVcache loss. Once KVcache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance.To address the above issues, I have designed and developed a local Tensor memory pool for storing KVcache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVcache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVcache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store.
Install vLLM
Run xPyD
Instructions
kv_buffer_size
(in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.kv_buffer_size
can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a largerkv_buffer_size
is required because it needs to store the kvcache sent to the D instance.kv_buffer_size
andport
in the following commands (if there is a conflict).PUT_ASYNC
offers the best performance and should be prioritized.--port
must be consistent with thehttp_port
in the--kv-transfer-config
.disagg_prefill_proxy_xpyd.py
script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances).quart
installed.proxy_ip
andproxy_port
in--kv-transfer-config
.Run 1P3D
Proxy (e.g. 10.0.1.1)
Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
Run 3P1D
Proxy (e.g. 10.0.1.1)
Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
Single request
Benchmark
Shut down
Test data
Scenario 1: 1K input & 1K output tokens, E2E P99 latency ~20s
1P5D (6×A800) vs vLLM (1×A800):
1P6D (7×A800) vs vLLM (1×A800):
Scenario 2: 1K input & 200 output tokens, E2E P99 latency ~4s
TODO
In this PR
In the following PRs