Skip to content

[Feature]: Benchmark script with speculative decode metrics #7586

Closed as not planned
@cermeng

Description

@cermeng

🚀 The feature, motivation and pitch

I am looking to assess the performance of vllm for speculative decode, but I have been unable to find an offline benchmark script similar to benchmark_latency.py that would allow me to test speculative decode performance. While I can use benchmark_latency.py to obtain e2e latency, it does not provide all of the spec-decode metrics such as the time spent on scoring, verifying, and proposing, as well as the acceptance rate.

Thanks to @cadedaniel's excellent contributions such as #6963 and #3103, we are now able to display spec-decode metrics, including scoring time, verification time, proposal time, and acceptance rate, in the server logging.

However, these metrics can only be viewed in online server logs and are implemented through an asynchronous collector, which could result in inaccuracies. I am considering adding a script called 'benchmark_spec_decode.py' for spec-decode benchmarking in order to capture more spec-decode-related metrics.

Some Proposal

Add a new field spec_decode_metrics of RequestMetrics

vllm/vllm/sequence.py

Lines 87 to 112 in 9587b05

class RequestMetrics:
"""Metrics associated with a request.
Attributes:
arrival_time: The time when the request arrived.
first_scheduled_time: The time when the request was first scheduled.
first_token_time: The time when the first token was generated.
time_in_queue: The time the request spent in the queue.
finished_time: The time when the request was finished.
scheduler_time: The time spent in the scheduler when this request was
being considered by the scheduler.
model_forward_time: The time spent in the model forward pass when this
request was in the batch.
model_execute_time: The time spent in the model execute function. This
will include model forward, block/sync across
workers, cpu-gpu sync time and sampling time.
"""
arrival_time: float
last_token_time: float
first_scheduled_time: Optional[float]
first_token_time: Optional[float]
time_in_queue: Optional[float]
finished_time: Optional[float] = None
scheduler_time: Optional[float] = None
model_forward_time: Optional[float] = None
model_execute_time: Optional[float] = None
and we can also consolidate the class SpecDecodeWorkerMetrics for more metrics related to spec-decode
class SpecDecodeWorkerMetrics:

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions