[Perf] Improve MLA multistream performance #1353

ApsarasX · 2025-06-22T07:10:14Z

What this PR does / why we need it?

Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1% performance gain.

Before Improvement

Profiling

Evaluation

# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================

After Improvement

Profiling

Evaluation

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Signed-off-by: ApsarasX <apsarax@outlook.com>

codecov · 2025-06-22T07:29:24Z

Codecov Report

Attention: Patch coverage is 16.66667% with 20 lines in your changes missing coverage. Please review.

Project coverage is 27.36%. Comparing base (097e714) to head (eecf066).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/attention/mla_v1.py	20.00%	8 Missing ⚠️
vllm_ascend/models/deepseek_v2.py	14.28%	6 Missing ⚠️
vllm_ascend/utils.py	14.28%	6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (16.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1353      +/-   ##
==========================================
- Coverage   27.39%   27.36%   -0.03%     
==========================================
  Files          56       56              
  Lines        6191     6201      +10     
==========================================
+ Hits         1696     1697       +1     
- Misses       4495     4504       +9

Flag	Coverage Δ
unittests	`27.36% <16.66%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-06-25T12:13:08Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Handle with_prefill_across_dp for multistream mla

3e679e2

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

github-actions bot added the module:core label Jun 22, 2025

ApsarasX force-pushed the improve-mla-multistream branch from 315d51e to e932430 Compare June 22, 2025 07:12

[Perf] Improve MLA multistream performance

eecf066

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the improve-mla-multistream branch from e932430 to eecf066 Compare June 22, 2025 07:14

github-actions bot added the merge-conflicts label Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Improve MLA multistream performance #1353

[Perf] Improve MLA multistream performance #1353

ApsarasX commented Jun 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Uh oh!

[Perf] Improve MLA multistream performance #1353

Are you sure you want to change the base?

[Perf] Improve MLA multistream performance #1353

Conversation

ApsarasX commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Before Improvement

After Improvement

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Uh oh!

ApsarasX commented Jun 22, 2025 •

edited

Loading

codecov bot commented Jun 22, 2025 •

edited

Loading