-
Notifications
You must be signed in to change notification settings - Fork 816
Add ResourceMonitor
module in Cortex, and add ResourceBasedLimiter
in Ingesters and StoreGateways
#6674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ResourceMonitor
module in Cortex, and add ResourceBasedLimiter
in Ingesters and StoreGateways
#6674
Conversation
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
30d1cba
to
9efbbd9
Compare
resource-thresholds
to throttle query requests when the pods are under resource pressure.
resource-thresholds
to throttle query requests when the pods are under resource pressure.resource-thresholds
in ingesters and store gateways to throttle query requests when the pods are under resource pressure.
841d578
to
5cccd60
Compare
Signed-off-by: Justin Jung <jungjust@amazon.com>
When choosing how to retrieve correct CPU and heap data, I basically tested different metrics from https://pkg.go.dev/runtime/metrics and https://github.com/prometheus/procfs, compared with kubernetes metrics to find closest metrics. I thought it's unnecessary to comment about different metrics that I tried, but let me know if you believe I should mention about it somewhere. |
Signed-off-by: Justin Jung <jungjust@amazon.com>
2081854
to
08a6adf
Compare
Signed-off-by: Justin Jung <jungjust@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also mark this feature as experimental and mention it in https://cortexmetrics.io/docs/configuration/v1guarantees/#experimental-features
Signed-off-by: Justin Jung <jungjust@amazon.com>
resource-thresholds
in ingesters and store gateways to throttle query requests when the pods are under resource pressure.monitored_resources
config + ResourceBasedLimiter
in ingesters and store gateways
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
monitored_resources
config + ResourceBasedLimiter
in ingesters and store gatewaysResourceMonitor
module in Cortex, and add ResourceBasedLimiter
in Ingesters and StoreGateways
Signed-off-by: Justin Jung <jungjust@amazon.com>
@yeya24 I've updated the PR to split the code into two parts:
Sample configuration
|
Signed-off-by: Justin Jung <jungjust@amazon.com>
@SungJin1212 Maybe you can help take a look at this failure? I don't see how the 2 results are different. |
Signed-off-by: Justin Jung <jungjust@amazon.com>
|
||
if i.resourceBasedLimiter != nil { | ||
if err := i.resourceBasedLimiter.AcceptNewRequest(); err != nil { | ||
level.Warn(i.logger).Log("msg", "failed to accept request", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to log the error here if query stats will report it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't query stats one per request and the errors are somehow aggregated? I imagined having ingester-level or store-gateway-level log will be helpful as it's related to the pod utilization limit that's breached, not a query-level limit
This seems related to the latest changes on metrics |
Signed-off-by: Justin Jung <jungjust@amazon.com>
c0e5514
to
6ffef63
Compare
Signed-off-by: Justin Jung <jungjust@amazon.com>
What this PR does:
This PR introduces ability to throttle incoming query requests in ingesters and store gateways when their CPU and heap is under pressure.
Data stores (ingesters and store gateways) currently don't have good ways to limit and control resource allocation per query request. Each query request has huge variance in its resource consumption, so it's hard to define static limits to protect ingesters or store gateways from using more than 100% CPU or being OOMkilled.
I'm introducing two new experimental components:
ResourceMonitor
is a new cortex module that takes snapshot of utilization of resources (cpu and heap for now) every 100 milliseconds, and other cortex modules can read those valuesResourceBasedLimiter
is a new limiter added in Ingesters and StoreGateways, which basically checks whether any of the resource utilization is equal to or above the configured limit. For now, Ingesters and StoreGateways reject incoming query requests when the limit is reached.Here is a test where high TPS of queries exhausting ingester CPU was throttled by the new feature, stabalizing the ingester CPU at around configured threshold of 40%.
Sample configurations:
Which issue(s) this PR fixes:
n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]