Description
🚀 The feature, motivation and pitch
I am looking to assess the performance of vllm for speculative decode, but I have been unable to find an offline benchmark script similar to benchmark_latency.py that would allow me to test speculative decode performance. While I can use benchmark_latency.py to obtain e2e latency, it does not provide all of the spec-decode metrics such as the time spent on scoring, verifying, and proposing, as well as the acceptance rate.
Thanks to @cadedaniel's excellent contributions such as #6963 and #3103, we are now able to display spec-decode metrics, including scoring time, verification time, proposal time, and acceptance rate, in the server logging.
However, these metrics can only be viewed in online server logs and are implemented through an asynchronous collector, which could result in inaccuracies. I am considering adding a script called 'benchmark_spec_decode.py' for spec-decode benchmarking in order to capture more spec-decode-related metrics.
Some Proposal
Add a new field spec_decode_metrics
of RequestMetrics
Lines 87 to 112 in 9587b05
SpecDecodeWorkerMetrics
for more metrics related to spec-decode vllm/vllm/spec_decode/metrics.py
Line 13 in 9587b05
Alternatives
No response
Additional context
No response