Skip to content

Add latency metrics #1870

Closed
Closed
@simon-mo

Description

@simon-mo

After #1662 (initial metrics support) and #1756 (refactoring chat endpoint), it will become practical to include latency metrics that's important to production (courtesy of @Yard1):

  • histogram of time to first token, and gauge of the mean, in ms
  • histogram of inter-token latency, and gauge of the mean, in ms
  • histogram of e2e time per request, and gauge of the mean, in ms
  • gauge of mean tokens per s per request. we currently only track the prefill and generation throughput, no request level.

A natural place to do it would be in the LLM engine or chat completion API, which ever one is less intrusive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions