-
Notifications
You must be signed in to change notification settings - Fork 555
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the Bug
I am using 8xH100 to test prefill/decode disaggregate performance, but i found the speed is slower than vllm.
results:
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Time To First Token (ms) │ 9,909.68 │ 1,676.80 │ 14,073.58 │ 14,049.07 │ 12,651.56 │ 12,555.37 │
│ Time To Second Token (ms) │ 374.48 │ 17.36 │ 2,741.04 │ 2,416.21 │ 1,459.71 │ 573.52 │
│ Request Latency (ms) │ 15,939.69 │ 4,143.97 │ 29,710.23 │ 26,518.76 │ 17,887.69 │ 16,162.14 │
│ Inter Token Latency (ms) │ 40.74 │ 16.58 │ 109.24 │ 105.84 │ 57.46 │ 48.92 │
│ Output Token Throughput Per User │ 28.99 │ 9.15 │ 60.32 │ 58.55 │ 44.59 │ 38.38 │
│ (tokens/sec/user) │ │ │ │ │ │ │
│ Output Sequence Length (tokens) │ 149.00 │ 149.00 │ 149.00 │ 149.00 │ 149.00 │ 149.00 │
│ Input Sequence Length (tokens) │ 3,000.05 │ 3,000.00 │ 3,001.00 │ 3,001.00 │ 3,000.00 │ 3,000.00 │
│ Output Token Throughput (tokens/sec) │ 292.03 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 1.96 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 128.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
it much slower than vllmx2 with tp=4.
Steps to Reproduce
here is my startup script, Referenced the script here #402
function _start_70b_dynamo_pdsep_benchmark_instance() {
model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
devices=$1
port=$2
tpsize=$3
kvconfig=$4
LOGFILE=$5
gpu_memory=0.85
other_params=''
if [ $(echo $kvconfig | grep kv_producer | wc -l) -ne 0 ]; then
other_params='--is-prefill-worker --max-num-batched-tokens 3500'
gpu_memory=0.95
fi
CUDA_VISIBLE_DEVICES=$devices nohup python3 \
-m dynamo.vllm \
--model $model \
--max-model-len 3500 \
--tensor-parallel-size $tpsize \
--max-seq-len-to-capture 65536 \
--no-enable-prefix-caching \
--disable-log-requests \
--kv-cache-dtype fp8 \
--block-size 128 \
--kv-transfer-config $kvconfig \
$other_params \
--gpu-memory-utilization $gpu_memory >> $LOGFILE 2>&1 &
}
LOGFILE=/tmp/test.log
kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_producer"}'
_start_70b_dynamo_pdsep_benchmark_instance 0 8100 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 1 8101 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 2 8102 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 3 8103 1 "$kvconfig" $LOGFILE
kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_consumer"}'
_start_70b_dynamo_pdsep_benchmark_instance 4,5,6,7 8200 4 "$kvconfig" $LOGFILE
nohup python3 -m dynamo.frontend --http-port 8000 >> $LOGFILE 2>&1 &
Expected Behavior
performance like in #402
Actual Behavior
xxx
Environment
Name: ai-dynamo
Version: 0.4.0
Summary: Distributed Inference Framework
Home-page:
Author:
Author-email: "NVIDIA Inc." sw-dl-dynamo@nvidia.com
License: Apache-2.0
Additional Context
No response
Screenshots
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working