-
Notifications
You must be signed in to change notification settings - Fork 7
Staged Encoder (E) & PD Timing for zmq scheme #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.9.1
Are you sure you want to change the base?
Conversation
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
本次PR为EPD(编码器、预填充、解码)执行路径增加了计时和指标。这些更改分布在多个文件中,从示例代码到核心调度器和指标系统。实现看起来基本正确,并遵循了PR描述。然而,我发现了两处代码重复,应予以解决以提高代码质量和可维护性。一处在示例文件中,另一处在核心调度器逻辑中。
| if self.log_stats and TIMECOUNT_ENABLED and\ | ||
| request.request_id not in self._epd_encoder_reqs: | ||
| # Record EPD encoder request | ||
| self._epd_encoder_reqs.add(request.request_id) | ||
| request.record_event( | ||
| EngineCoreEventType.ENCODER_CONSUME_START) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段记录 ENCODER_CONSUME_START 事件的代码与本文件中第 324-329 行的代码完全相同。为了提高代码的可维护性并避免重复,建议将此逻辑提取到一个单独的私有方法中,然后在两个地方调用该方法。
例如,您可以创建一个像这样的辅助方法:
def _record_encoder_start_event(self, request: Request):
if self.log_stats and TIMECOUNT_ENABLED and request.request_id not in self._epd_encoder_reqs:
# Record EPD encoder request
self._epd_encoder_reqs.add(request.request_id)
request.record_event(EngineCoreEventType.ENCODER_CONSUME_START)然后在两个位置调用 self._record_encoder_start_event(request)。
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong <liujunhong11@huawei.com>
| self.histogram_max_tokens_request.observe( | ||
| finished_request.max_tokens_param) | ||
| self.histogram_encoder_consume_seconds.observe( | ||
| finished_request.encoder_consume_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class LoggingStatLogger的同样函数也加一下处理,然后就可以在LoggingStatLogger的log方法里面打出来,不需要再解析prometheus的字段了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if mm_data.request_id not in encoder_cache: | ||
| encoder_cache[mm_data.request_id] = {} | ||
| encoder_cache[mm_data.request_id][input_id] = ec_cache | ||
| if TIMECOUNT_ENABLED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按照request的粒度打印,不要按照input_id的粒度打印
Signed-off-by: Junhong <liujunhong11@huawei.com>
Purpose
对EPD的执行时间计时

统计时间分成6大部分:
记录方法:
修改:
1.添加性能指标
2.在proxy和worker中添加了/metrics接口
a. 自动搜索可用端口并输出端口
开关:TIMECOUNT_ENABLED=1 or 0
端口会在拉起实例时输出(http://127.0.0.1:XXXX)
获取metrics指标
curl http://127.0.0.1:XXXX/metricsTest Plan
Test Result
INFO 10-25 17:15:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1667.6 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
INFO 10-25 17:15:10 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
INFO 10-25 17:15:20 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:30 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:40 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:15:50 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:16:00 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
INFO 10-25 17:16:10 [disagg_worker.py:78] DisaggWorker metrics:{'vllm:request_queue_time_seconds|engine=0': {'count': 10.0, 'mean': 0.5825753671117126}, 'vllm:request_prefill_time_seconds|engine=0': {'count': 10.0, 'mean': 0.2375835937447846}, 'vllm:e2e_request_latency_seconds|engine=0': {'count': 10.0, 'mean': 2.087323236465454}, 'vllm:request_encoder_consume_time_seconds|engine=0': {'count': 10.0, 'mean': 0.23731594048440458}}
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.