[BUG]: Dynamo+vLLM pd performance is slower than vllm

### Describe the Bug

I am using 8xH100 to test prefill/decode disaggregate performance, but i found the speed is slower than vllm.



results:

```bash
┃                            Statistic ┃       avg ┃      min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │  9,909.68 │ 1,676.80 │ 14,073.58 │ 14,049.07 │ 12,651.56 │ 12,555.37 │
│            Time To Second Token (ms) │    374.48 │    17.36 │  2,741.04 │  2,416.21 │  1,459.71 │    573.52 │
│                 Request Latency (ms) │ 15,939.69 │ 4,143.97 │ 29,710.23 │ 26,518.76 │ 17,887.69 │ 16,162.14 │
│             Inter Token Latency (ms) │     40.74 │    16.58 │    109.24 │    105.84 │     57.46 │     48.92 │
│     Output Token Throughput Per User │     28.99 │     9.15 │     60.32 │     58.55 │     44.59 │     38.38 │
│                    (tokens/sec/user) │           │          │           │           │           │           │
│      Output Sequence Length (tokens) │    149.00 │   149.00 │    149.00 │    149.00 │    149.00 │    149.00 │
│       Input Sequence Length (tokens) │  3,000.05 │ 3,000.00 │  3,001.00 │  3,001.00 │  3,000.00 │  3,000.00 │
│ Output Token Throughput (tokens/sec) │    292.03 │      N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      1.96 │      N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │    128.00 │      N/A │       N/A │       N/A │       N/A │       N/A │

```

it much slower than vllmx2 with tp=4.

### Steps to Reproduce

here is my startup script, Referenced the script here [#402](https://github.com/ai-dynamo/dynamo/issues/402#issuecomment-2769181444)

```bash
function _start_70b_dynamo_pdsep_benchmark_instance() {
  model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
  devices=$1
  port=$2
  tpsize=$3
  kvconfig=$4
  LOGFILE=$5

  gpu_memory=0.85
  other_params=''
  if [ $(echo $kvconfig | grep kv_producer | wc -l) -ne 0 ]; then
    other_params='--is-prefill-worker --max-num-batched-tokens 3500'
    gpu_memory=0.95
  fi

  CUDA_VISIBLE_DEVICES=$devices nohup python3 \
    -m dynamo.vllm \
    --model $model \
    --max-model-len 3500 \
    --tensor-parallel-size $tpsize \
    --max-seq-len-to-capture 65536 \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --kv-cache-dtype fp8 \
    --block-size 128 \
    --kv-transfer-config $kvconfig \
    $other_params \
    --gpu-memory-utilization $gpu_memory  >> $LOGFILE 2>&1 &
}

LOGFILE=/tmp/test.log
kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_producer"}'
_start_70b_dynamo_pdsep_benchmark_instance 0 8100 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 1 8101 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 2 8102 1 "$kvconfig" $LOGFILE
_start_70b_dynamo_pdsep_benchmark_instance 3 8103 1 "$kvconfig" $LOGFILE

kvconfig='{"kv_connector":"DynamoNixlConnector","kv_role":"kv_consumer"}'
_start_70b_dynamo_pdsep_benchmark_instance 4,5,6,7 8200 4 "$kvconfig" $LOGFILE
  
nohup python3 -m dynamo.frontend --http-port 8000 >> $LOGFILE 2>&1 &

```


### Expected Behavior

performance like in [#402](https://github.com/ai-dynamo/dynamo/issues/402#issuecomment-2769181444)

### Actual Behavior

xxx

### Environment

Name: ai-dynamo
Version: 0.4.0
Summary: Distributed Inference Framework
Home-page:
Author:
Author-email: "NVIDIA Inc." <sw-dl-dynamo@nvidia.com>
License: Apache-2.0

### Additional Context

_No response_

### Screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Dynamo+vLLM pd performance is slower than vllm #2552

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Dynamo+vLLM pd performance is slower than vllm #2552

Description

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions