[Feature]: Benchmark script with speculative decode metrics

### 🚀 The feature, motivation and pitch

I am looking to assess the performance of vllm for speculative decode, but I have been unable to find an offline benchmark script similar to [benchmark_latency.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_latency.py) that would allow me to test speculative decode performance. While I can use  [benchmark_latency.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_latency.py) to obtain e2e latency, it does not provide all of the spec-decode metrics such as the time spent on scoring, verifying, and proposing, as well as the acceptance rate.

Thanks to @cadedaniel's excellent contributions such as https://github.com/vllm-project/vllm/pull/6963 and https://github.com/vllm-project/vllm/pull/3103, we are now able to display spec-decode metrics, including scoring time, verification time, proposal time, and acceptance rate, in the server logging.

However, these metrics can only be viewed in online server logs and are implemented through an asynchronous collector, which could result in inaccuracies. I am considering adding a script called 'benchmark_spec_decode.py' for spec-decode benchmarking in order to capture more spec-decode-related metrics.

### Some Proposal
Add a new field `spec_decode_metrics` of `RequestMetrics` https://github.com/vllm-project/vllm/blob/9587b050fba00c3c35da05d3512bf7e351914a50/vllm/sequence.py#L87-L112 and we can also consolidate the class `SpecDecodeWorkerMetrics` for more metrics related to spec-decode https://github.com/vllm-project/vllm/blob/9587b050fba00c3c35da05d3512bf7e351914a50/vllm/spec_decode/metrics.py#L13


### Alternatives

_No response_

### Additional context

_No response_

	class RequestMetrics:
	"""Metrics associated with a request.

	Attributes:
	arrival_time: The time when the request arrived.
	first_scheduled_time: The time when the request was first scheduled.
	first_token_time: The time when the first token was generated.
	time_in_queue: The time the request spent in the queue.
	finished_time: The time when the request was finished.
	scheduler_time: The time spent in the scheduler when this request was
	being considered by the scheduler.
	model_forward_time: The time spent in the model forward pass when this
	request was in the batch.
	model_execute_time: The time spent in the model execute function. This
	will include model forward, block/sync across
	workers, cpu-gpu sync time and sampling time.
	"""
	arrival_time: float
	last_token_time: float
	first_scheduled_time: Optional[float]
	first_token_time: Optional[float]
	time_in_queue: Optional[float]
	finished_time: Optional[float] = None
	scheduler_time: Optional[float] = None
	model_forward_time: Optional[float] = None
	model_execute_time: Optional[float] = None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Benchmark script with speculative decode metrics #7586

🚀 The feature, motivation and pitch

Some Proposal

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Benchmark script with speculative decode metrics #7586

Description

🚀 The feature, motivation and pitch

Some Proposal

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions