Closed
Description
After #1662 (initial metrics support) and #1756 (refactoring chat endpoint), it will become practical to include latency metrics that's important to production (courtesy of @Yard1):
- histogram of time to first token, and gauge of the mean, in ms
- histogram of inter-token latency, and gauge of the mean, in ms
- histogram of e2e time per request, and gauge of the mean, in ms
- gauge of mean tokens per s per request. we currently only track the prefill and generation throughput, no request level.
A natural place to do it would be in the LLM engine or chat completion API, which ever one is less intrusive.