Skip to content

Pipeline prefill chunks #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 18, 2025
Merged

Pipeline prefill chunks #99

merged 3 commits into from
Jul 18, 2025

Conversation

gty111
Copy link
Owner

@gty111 gty111 commented Jul 18, 2025

reference vllm-project/vllm#17080

This PR can pipeline prefill chunks of one request. Thus it can greatly lower latency for long text input.

Test on 4 x 4090, 32 reqs, 0.5 reqs/s, PP4, TP1

Before this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  94.16     
Total input tokens:                      314732    
Total generated tokens:                  8411      
Request throughput (req/s):              0.34      
Output token throughput (tok/s):         89.32     
Total Token throughput (tok/s):          3431.73   
---------------Time to First Token----------------
Mean TTFT (ms):                          4374.34   
Median TTFT (ms):                        4177.95   
P99 TTFT (ms):                           6679.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.85     
Median TPOT (ms):                        75.12     
P99 TPOT (ms):                           185.22    
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.82     
Median ITL (ms):                         42.95     
P99 ITL (ms):                            344.36    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22068.80  
Median E2EL (ms):                        20669.61  
P99 E2EL (ms):                           46538.70  
==================================================

After this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  91.59     
Total input tokens:                      314732    
Total generated tokens:                  8449      
Request throughput (req/s):              0.35      
Output token throughput (tok/s):         92.25     
Total Token throughput (tok/s):          3528.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          1200.22   
Median TTFT (ms):                        1086.74   
P99 TTFT (ms):                           2277.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.04     
Median TPOT (ms):                        52.14     
P99 TPOT (ms):                           90.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49     
Median ITL (ms):                         39.60     
P99 ITL (ms):                            376.58    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14306.62  
Median E2EL (ms):                        13230.37  
P99 E2EL (ms):                           32460.10  
==================================================

@gty111 gty111 merged commit 1293c7e into master Jul 18, 2025
2 checks passed
@gty111 gty111 deleted the continue_prefill branch July 18, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant