-
Notifications
You must be signed in to change notification settings - Fork 496
Description
Describe your usage question
Environment
Version: 0.3.7post2
Deployment: vLLM + Mooncake Store (Embedded mode)
Configuration:
mooncake master
"--rpc_port", {rpc_port}, "--rpc_address", f"{mooncake_ip}", "--enable_http_metadata_server=true",
"--http_metadata_server_host={mooncake_ip}",
"--http_metadata_server_port={http_metadata_server_port}", "--rpc_thread_num", "8",
"--default_kv_lease_ttl", "10000", "--eviction_ratio", "0.05",
"--eviction_high_watermark_ratio", "0.9", "--metrics_port", 9004
mooncake store
{"local_hostname":"{mooncake_ip}",'
"metadata_server": "http://{mooncake_ip}:{http_metadata_server_port}/metadata","global_segment_size": 32212254720, '
"local_buffer_size": 1073741824, "protocol": "tcp","transfer_timeout":"20", "device_name": "",'
"master_server_address": "{mooncake_ip}:{rpc_port}", "fast_transfer":true, '
"fast_transfer_buffer_size": 1, "replica_num": 3}
Problem Description:
Under continuous request load, when performing scale-in (downscaling) of vLLM instances, we observe inference failures.
Specifically, the following error is repeatedly reported during the client_ttl window.:
tcp_transport.cpp:528] TcpTransport::startTransfer encountered an ASIO exception. Slice details - source_addr: 0xfff16fffff10, length: 47, opcode: 0, target_id: 5. Exception: connect: Connection refused
[P_0] : I20251220 18:02:33.944829 281444442173856 tcp_transport.cpp:479] [TcpTransport] Resolving host: 10.170.27.89, port: 15172
[P_0] : I20251220 18:02:33.944850 281444442173856 tcp_transport.cpp:487] [TcpTransport] Resolved endpoint: 10.170.27.89:15172
[P_0] : I20251220 18:02:33.944863 281444442173856 tcp_transport.cpp:492] [TcpTransport] Connecting to resolved endpoints...
[P_0] : E20251220 18:02:33.945003 281444442173856 tcp_transport.cpp:528] TcpTransport::startTransfer encountered an ASIO exception. Slice details - source_addr: 0xfff16fffff40, length: 47, opcode: 0, target_id: 5. Exception: connect: Connection refused
[P_0] : I20251220 18:02:33.945027 281444442173856 tcp_transport.cpp:479] [TcpTransport] Resolving host: fe80::1159:c2ee:1944:847b%enp67s0f5, port: 15320
[P_0] : I20251220 18:02:33.945060 281444442173856 tcp_transport.cpp:487] [TcpTransport] Resolved endpoint: fe80::1159:c2ee:1944:847b%enp67s0f5:15320
[P_0] : I20251220 18:02:33.945078 281444442173856 tcp_transport.cpp:492] [TcpTransport] Connecting to resolved endpoints...
[P_0] : I20251220 18:02:33.945183 281444442173856 tcp_transport.cpp:497] [TcpTransport] Connected to fe80::1159:c2ee:1944:847b%enp67s0f5:15320
[P_0] : I20251220 18:02:33.945207 281444442173856 tcp_transport.cpp:515] [TcpTransport] Initiating session. slice_addr=0xfff16fffff70, dest_addr=fff1e4033240, len=47, opcode=0
[P_0] : E20251220 18:02:33.945256 281444442173856 transfer_task.cpp:247] Transfer failed for batch 281413842521600 task 0 with status 6
[P_0] : E20251220 18:02:33.945265 281444442173856 client.cpp:727] Transfer failed for key: 9b72145d726168f8f9913fc69032f2b6c545a05df0a09d860ac67e4dfccb6a2d_metadata with error: -800
[P_0] : E20251220 18:02:33.945274 281444442173856 transfer_task.cpp:247] Transfer failed for batch 281413842148928 task 0 with status 6
[P_0] : E20251220 18:02:33.945280 281444442173856 client.cpp:727] Transfer failed for key: fec27c20a6c064576b9ceed6229b6aa3cb9871a388d21ff5d60b79eb8e189495_metadata with error: -800
[P_0] : I20251220 18:02:33.945276 281444391448992 tcp_transport.cpp:507] [TcpTransport] Slice transfer success.
[P_0] : E20251220 18:02:33.945303 281444442173856 transfer_task.cpp:247] Transfer failed for batch 281413842638256 task 0 with status 6
[P_0] : E20251220 18:02:33.945316 281444442173856 client.cpp:727] Transfer failed for key: 474b2e45ee1aee2fe0c9770f9ea7a88b85f9aee4b87ec2698837f60d853aa6d2_metadata with error: -800
[P_0] : E20251220 18:02:33.945322 281444442173856 transfer_task.cpp:247] Transfer failed for batch 281413843099968 task 0 with status 6
[P_0] : E20251220 18:02:33.945328 281444442173856 client.cpp:727] Transfer failed for key: 6453d0a8e126b344e2b3b7067734f5aded3ebccaf5adf6e23d03dbd1eafbae01_metadata with error: -800
[P_0] : E20251220 18:02:33.945333 281444442173856 transfer_task.cpp:247] Transfer failed for batch 281413842057520 task 0 with status 6
This error directly impacts inference availability.
Reproduction Scenario:
Continuously send inference requests.
Scale in (reduce) the number of vLLM instances.
Within the client_ttl period, some requests fail with Transfer failed for key errors.
I have two questions:
1.Graceful shutdown support
Does Mooncake support a graceful shutdown mechanism?
For example, after a store client receives a termination signal, can it:
Immediately stop serving KV requests, and proactively migrate KV data to other storage nodes,
so that Transfer failed errors can be avoided during scale-in?
2.Replica behavior during scale-in
With replica_num = 3, we scale in a single storage instance.
Our expectation is that if KV data cannot be fetched from the scaled-in node, retries should succeed by fetching the same KV data from other replicas.
However, in practice, even after multiple retries for serveral minutes, some KV entries still fail to be retrieved.
What could be the cause of this behavior?
Is this expected under the current Mooncake replication or consistency model?
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation