-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1][WIP] Metrics & Stats prototype #10651
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: rickyx <rickyx@anyscale.com>
626d716
to
0a666e7
Compare
except BaseException as e: | ||
logger.error(e) | ||
raise e | ||
|
||
# TODO: can we eliminate these? | ||
async def _run_stats_handler(self): | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where stats were polled from the backend.
@@ -344,6 +379,18 @@ def process_input_socket(self, input_path: str): | |||
request_type = type_frame.buffer | |||
request_data = data_frame.buffer | |||
|
|||
if request_type == EngineCoreRequestType.STATS.value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the backend process responds. Note this is independent from the actual execution thread.
This pull request has merge conflicts that must be resolved before it can be |
@@ -290,6 +303,9 @@ def schedule(self) -> "SchedulerOutput": | |||
free_encoder_input_ids=self.encoder_cache_manager.get_freed_ids(), | |||
) | |||
|
|||
if self.log_stats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that got us in trouble in V0 is too many loops over these types of data structures.
I wonder if the creation of the request stats could be done in the list comprehensions used above,
This pull request has merge conflicts that must be resolved before it can be |
@@ -146,6 +170,7 @@ async def add_request( | |||
detokenizer_req, engine_core_req = self.processor.process_inputs( | |||
request_id, prompt, params, arrival_time, lora_request, | |||
trace_headers, prompt_adapter_request, priority) | |||
self.stats_manager.record_engine_input(engine_core_req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this for LLMEngine
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes - the prototype was only on async llm, but it should be the same for sync llm engine.
stat_logger.log(stats) | ||
|
||
|
||
class ThreadSafeEngineStatsManager(EngineStatsManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any threads in the AsyncLLM
class right now (its just using asyncio
. So the Threadsafe
variant is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right - good catch. I wonder in the future we should probably isolate the stats logger in another thread since the stats logger could be pretty expensive (e.g. the prometheus logging is quite slow when it comes to histograms).
# on the background thread. | ||
# 2. the main event loop when a request is added (and input | ||
# processed). | ||
self.stats_manager = ThreadSafeEngineStatsManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any threads in this class (just asyncio
) --- so we don't need a threadsafe variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design is on the right track. As discussed in slack (https://vllm-dev.slack.com/archives/C07QTE01QFQ/p1732588846571919), I think that we should remove the EngineStatsAgent
from the EngineCore
and instead send the stats back in EngineCoreOutputs
each step.
This will simplify the state management in the EngineCore
and reduce the number of RPCs we have to worry about with the same amount of work being executed on the core busy loop as the current design.
Requirements and Goals
Proposal and Design
RequestStats
andRequestStatsUpdate
to capture requests updates for metrics. It captures the entire request lifecycle (arrival -> input process -> engine core prefill/decode -> [preempt] -> detokenized).EngineStatsAgent
that lives with theEngineCore
to collect stats updates from various core components and scheduler/model outputsEngineStatsManager
class that lives on theAsyncLLM
engine that aggregates stats from the agent.On Backend (EngineCore Process):
EngineStatsAgent
collects metrics in a thread-safe mannerOn Frontend (Main Process):
EngineStatsManager
aggregates stats from backend as well as collect stats on the engine frontend.Benchmark Results
TLDR: There's no observable perf regression when stats collection happens every 1 sec for the below workloads comparing to stats collection turned off.
Throughput
Latency