Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: per tenant WorkQueue latency metrics #134987

Open
sumeerbhola opened this issue Nov 12, 2024 · 1 comment
Open

admission: per tenant WorkQueue latency metrics #134987

sumeerbhola opened this issue Nov 12, 2024 · 1 comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 12, 2024

CockroachDB has some support for per-tenant metrics. In multi-tenant environments like CockroachDB standard/basic, a tenant (or cluster operators) should be able to see queueing delay in AC WorkQueues on a per-tenant basis (and the aggregate across all tenants). Currently WorkQueue metrics are only segmented by priority.

This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.

The same approach can then be applied to replication AC queuing latency metrics.

Jira issue: CRDB-44325

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Nov 12, 2024
@aadityasondhi aadityasondhi added the T-admission-control Admission Control label Nov 12, 2024
@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Nov 12, 2024

We probably want to know if we have infrastructure to support the cardinality involved in supporting these types of metrics.

@dhartunian do you have any thoughts on this? Specifically about this part:

This will possibly need additional observability infrastructure since a long running kv server can cycle through thousands of tenants, and we should not keep exporting expensive histograms for tenants that are not active on a kv server. The typical solution to this problem in multi-tenant systems is to export delta metrics instead of cumulative metrics, where when the delta is zero for a timeseries, nothing is exported. So the number of timeseries becomes proportional to the number of active tenants in a server.

cc @dshjoshi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

2 participants