Skip to content

[BUG]: Dynamo+vLLM pd performance is slower than vllm #2552

@guyuankan

Description

@guyuankan

Describe the Bug

I am using 8xH100 to test prefill/decode disaggregate performance, but i found the speed is slower than vllm.

results:

┃                            Statistic ┃       avg ┃      min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │  9,909.68 │ 1,676.80 │ 14,073.58 │ 14,049.07 │ 12,651.56 │ 12,555.37 │
│            Time To Second Token (ms) │    374.48 │    17.36 │  2,741.04 │  2,416.21 │  1,459.71 │    573.52 │
│                 Request Latency (ms) │ 15,939.69 │ 4,143.97 │ 29,710.23 │ 26,518.76 │ 17,887.69 │ 16,162.14 │
│             Inter Token Latency (ms) │     40.74 │    16.58 │    109.24 │    105.84 │     57.46 │     48.92 │
│     Output Token Throughput Per User │     28.99 │     9.15 │     60.32 │     58.55 │     44.59 │     38.38 │
│                    (tokens/sec/user) │           │          │           │           │           │           │
│      Output Sequence Length (tokens) │    149.00 │   149.00 │    149.00 │    149.00 │    149.00 │    149.00 │
│       Input Sequence Length (tokens) │  3,000.05 │ 3,000.00 │  3,001.00 │  3,001.00 │  3,000.00 │  3,000.00 │
│ Output Token Throughput (tokens/sec) │    292.03 │      N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      1.96 │      N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │    128.00 │      N/A │       N/A │       N/A │       N/A │       N/A │

it much slower than vllmx2 with tp=4.

Steps to Reproduce

here is my startup script, Referenced the script here #402

function _start_70b_dynamo_pdsep_benchmark_instance() {
  model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
  devices=$1
  port=$2
  tpsize=$3
  kvconfig=$4
  LOGFILE=$5

  gpu_memory=0.85
  other_params=''
  if [ $(echo $kvconfig | grep kv_producer | wc -l) -ne 0 ]; then
    other_params='--is-prefill-worker --max-num-batched-tokens 3500'
    gpu_memory=0.95
  fi

  CUDA_VISIBLE_DEVICES=$devices nohup python3 \
    -m dynamo.vllm \
    --model $model \
    --max-model-len 3500 \
    --tensor-parallel-size $tpsize \
    --max-seq-len-to-capture 65536 \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --kv-cache-dtype fp8 \
    --block-size 128 \
    --kv-transfer-config $kvconfig \
    $other_params \
    --gpu-memory-utilization $gpu_memory  >> $LOGFILE 2>&1 &
}

LOGFILE=/tmp/test.log
kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_producer"}'
_start_70b_dynamo_pdsep_benchmark_instance 0 8100 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 1 8101 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 2 8102 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 3 8103 1 "$kvconfig" $LOGFILE

kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_consumer"}'
_start_70b_dynamo_pdsep_benchmark_instance 4,5,6,7 8200 4 "$kvconfig" $LOGFILE
  
nohup python3 -m dynamo.frontend --http-port 8000 >> $LOGFILE 2>&1 &

Expected Behavior

performance like in #402

Actual Behavior

xxx

Environment

Name: ai-dynamo
Version: 0.4.0
Summary: Distributed Inference Framework
Home-page:
Author:
Author-email: "NVIDIA Inc." sw-dl-dynamo@nvidia.com
License: Apache-2.0

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions