Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,ac: add delay totals by cause to responses #135036

Open
dt opened this issue Nov 12, 2024 · 1 comment
Open

kv,ac: add delay totals by cause to responses #135036

dt opened this issue Nov 12, 2024 · 1 comment
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control T-kv KV Team

Comments

@dt
Copy link
Member

dt commented Nov 12, 2024

Currently a caller sending KV requests to the KV server more or less only knows how long they wait for the response. If 30% of the time the caller was waiting for the response took was because AC ensured the request waited for CPU or IO resources, that is not currently visible to the caller. When a caller like BACKUP or LDR is issuing many such requests, it may appear slow to the user who ran it or is observing it, but currently the job has no mechanism to determine that it is being slowed due to a specific resource capacity constraint, or to communicate that to the user who could act on that.

The user can observe the overall cluster overload metrics, which could give some clues that something is being delayed due to CPU or IO overload, but these metrics describe all work across all nodes, which can make the effect any specific job is seeing harder to determine. These metrics also are grouped by the node delaying work, rather than by the work delayed, further complicating the relationship between them and any specific job.

Of course we can also collect traces to of specific requests to follow where they spend time, including in queue delays, in detail. This too however is not a perfect fit for an operation like a job that sends thousands of requests: tracing the entire execution of every request just to know CPU or IO limiting's aggregate impact is prohibitively expensive and produces a vast amount of trace information that is not useful or relevant to job unless a user is actively tracing it.

Instead, simply pulling out two or perhaps three broad categories where a request may be delayed due to user-controllable resources like CPU, IO and perhaps contention/latching and passing this information in a simple duration to the caller to aggregate could allow jobs and other user-facing operations to present clear, user-actionable messages directly to the user when they interact with the job or operation.

Jira issue: CRDB-44335

@dt dt added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control T-admission-control Admission Control labels Nov 12, 2024
@sumeerbhola sumeerbhola added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Nov 12, 2024
@sumeerbhola
Copy link
Collaborator

We need KV involvement as discussed in https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1730129705599459?thread_ts=1730121538.919249&cid=C01SRKWGHG8

@arulajmani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants