Skip to content

Conversation

@LJH-LBJ
Copy link
Collaborator

@LJH-LBJ LJH-LBJ commented Oct 22, 2025

Purpose

对EPD的执行时间计时
image

统计时间分成6大部分:

  1. vllm_proxy_transfer_to_pd_seconds:proxy转发消息到pd收到的时间
  2. vllm_proxy_transfer_to_encode_seconds:proxy转发消息到encoder收到的时间
  3. vllm_Encoder_request_queue_time_seconds:请求在encoder排队的时间
  4. vllm_PD_request_queue_time_seconds:请求在PD排队的时间
  5. vllm_execute_mm_encoder_seconds:执行encoder计算的时间(从需要计算Encoder直到得到Encoder输出的时间)
  6. vllm_Encoder_cache_trans_seconds:Encoder的cache从存储到获取的时间
  7. vllm_prefill_forward_seconds:prefill执行的时间

记录方法:

  1. vllm_proxy_transfer_to_pd_seconds: logger输出Proxy sending prefill request和 generation received proxy request
  2. vllm_proxy_transfer_to_encode_seconds: logger输出Proxy sending encode request和encode received proxy request
  3. vllm_Encoder_request_queue_time_seconds和vllm_PD_request_queue_time_seconds在vllm用已有指标vllm:request_queue_time_seconds代替
  4. vllm_execute_mm_encoder_seconds:添加metrics指标vllm:request_encoder_consume_time
  5. vllm_Encoder_cache_trans_seconds:logger输出Save cache successful for和Success load encoder cache for request_id
  6. vllm_prefill_forward_seconds在vllm用已有指标vllm:request_prefill_time_seconds代替

修改:
1.添加性能指标
2.在proxy和worker中添加了/metrics接口
a. 自动搜索可用端口并输出端口

开关:TIMECOUNT_ENABLED=1 or 0

端口会在拉起实例时输出(http://127.0.0.1:XXXX)
获取metrics指标
curl http://127.0.0.1:XXXX/metrics

Test Plan

Test Result

INFO 10-25 17:15:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1667.6 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
INFO 10-25 17:15:10 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
INFO 10-25 17:15:20 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:30 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:40 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:50 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:16:00 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:16:10 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Junhong added 9 commits October 16, 2025 20:51
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
@LJH-LBJ LJH-LBJ added this to the EPD-main (1030) milestone Oct 22, 2025
@LJH-LBJ LJH-LBJ self-assigned this Oct 22, 2025
@LJH-LBJ LJH-LBJ added the enhancement New feature or request label Oct 22, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次PR为EPD(编码器、预填充、解码)执行路径增加了计时和指标。这些更改分布在多个文件中,从示例代码到核心调度器和指标系统。实现看起来基本正确,并遵循了PR描述。然而,我发现了两处代码重复,应予以解决以提高代码质量和可维护性。一处在示例文件中,另一处在核心调度器逻辑中。

Comment on lines +515 to +520
if self.log_stats and TIMECOUNT_ENABLED and\
request.request_id not in self._epd_encoder_reqs:
# Record EPD encoder request
self._epd_encoder_reqs.add(request.request_id)
request.record_event(
EngineCoreEventType.ENCODER_CONSUME_START)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

这段记录 ENCODER_CONSUME_START 事件的代码与本文件中第 324-329 行的代码完全相同。为了提高代码的可维护性并避免重复,建议将此逻辑提取到一个单独的私有方法中,然后在两个地方调用该方法。
例如,您可以创建一个像这样的辅助方法:

def _record_encoder_start_event(self, request: Request):
    if self.log_stats and TIMECOUNT_ENABLED and request.request_id not in self._epd_encoder_reqs:
        # Record EPD encoder request
        self._epd_encoder_reqs.add(request.request_id)
        request.record_event(EngineCoreEventType.ENCODER_CONSUME_START)

然后在两个位置调用 self._record_encoder_start_event(request)

Junhong added 12 commits October 22, 2025 16:19
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
self.histogram_max_tokens_request.observe(
finished_request.max_tokens_param)
self.histogram_encoder_consume_seconds.observe(
finished_request.encoder_consume_time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class LoggingStatLogger的同样函数也加一下处理,然后就可以在LoggingStatLoggerlog方法里面打出来,不需要再解析prometheus的字段了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if mm_data.request_id not in encoder_cache:
encoder_cache[mm_data.request_id] = {}
encoder_cache[mm_data.request_id][input_id] = ec_cache
if TIMECOUNT_ENABLED:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照request的粒度打印,不要按照input_id的粒度打印

opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Junhong added 8 commits October 27, 2025 15:14
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
opt
Signed-off-by: Junhong <liujunhong11@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation

3 participants