kv,ac: add delay totals by cause to responses #135036
Labels
A-admission-control
A-kv
Anything in KV that doesn't belong in a more specific category.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-admission-control
Admission Control
T-kv
KV Team
Currently a caller sending KV requests to the KV server more or less only knows how long they wait for the response. If 30% of the time the caller was waiting for the response took was because AC ensured the request waited for CPU or IO resources, that is not currently visible to the caller. When a caller like BACKUP or LDR is issuing many such requests, it may appear slow to the user who ran it or is observing it, but currently the job has no mechanism to determine that it is being slowed due to a specific resource capacity constraint, or to communicate that to the user who could act on that.
The user can observe the overall cluster overload metrics, which could give some clues that something is being delayed due to CPU or IO overload, but these metrics describe all work across all nodes, which can make the effect any specific job is seeing harder to determine. These metrics also are grouped by the node delaying work, rather than by the work delayed, further complicating the relationship between them and any specific job.
Of course we can also collect traces to of specific requests to follow where they spend time, including in queue delays, in detail. This too however is not a perfect fit for an operation like a job that sends thousands of requests: tracing the entire execution of every request just to know CPU or IO limiting's aggregate impact is prohibitively expensive and produces a vast amount of trace information that is not useful or relevant to job unless a user is actively tracing it.
Instead, simply pulling out two or perhaps three broad categories where a request may be delayed due to user-controllable resources like CPU, IO and perhaps contention/latching and passing this information in a simple duration to the caller to aggregate could allow jobs and other user-facing operations to present clear, user-actionable messages directly to the user when they interact with the job or operation.
Jira issue: CRDB-44335
The text was updated successfully, but these errors were encountered: