-
Notifications
You must be signed in to change notification settings - Fork 816
Add ResourceMonitor
module in Cortex, and add ResourceBasedLimiter
in Ingesters and StoreGateways
#6674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
yeya24
merged 30 commits into
cortexproject:master
from
justinjung04:resource-based-throttling
Apr 18, 2025
Merged
Add ResourceMonitor
module in Cortex, and add ResourceBasedLimiter
in Ingesters and StoreGateways
#6674
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
80b2d5c
Add resource based throttling to ingesters and store gateways
justinjung04 2121845
doc
justinjung04 2b168fc
Add automaxprocs
justinjung04 56f8e57
nit
justinjung04 9efbbd9
Add test for monitor
justinjung04 30bbd3d
fix tests
justinjung04 fa56e65
changelog
justinjung04 a2ffcdd
Merge branch 'master' into resource-based-throttling
justinjung04 5cccd60
fix test
justinjung04 6e37330
remove interface
justinjung04 08a6adf
address comments
justinjung04 067478b
rename doc
justinjung04 18fdf37
Make monitor more generic + separate scanners
justinjung04 aa81155
fix tests
justinjung04 a528a7a
fix more tests
justinjung04 42e52b3
remove monitor_test.go
justinjung04 50993e1
move noop scanner to darwin scanner
justinjung04 e56431e
doc update
justinjung04 eae4df7
doc
justinjung04 fd19f5c
lint
justinjung04 f588d94
add debugging log on unsupported resource type
justinjung04 6138a9d
test
justinjung04 7bd7ab9
add more error handling + resource_based_limiter_limit metric
justinjung04 6da53e9
fix test
justinjung04 a8d4218
fix test
justinjung04 d6d3839
update changelog
justinjung04 c68bbd2
Move noopScanner to scanner.go and fix RegisterFlagsWithPrefix
justinjung04 025a93a
Add limit breached metric + wrap error with 429
justinjung04 6ffef63
Add more validation and test on instance_limits
justinjung04 7808940
Added _total to counter metric
justinjung04 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
title: "Protecting Cortex from Heavy Queries" | ||
linkTitle: "Protecting Cortex from Heavy Queries" | ||
weight: 11 | ||
slug: protecting-cortex-from-heavy-queries | ||
--- | ||
|
||
PromQL is powerful, and is able to result in query requests that have very wide range of data fetched and samples processed. Heavy queries can cause: | ||
|
||
1. CPU on any query component to be partially exhausted, increasing latency and causing incoming queries to queue up with high chance of time-out. | ||
2. CPU on any query component to be fully exhausted, causing GC to slow down leading to the pod being out-of-memory and killed. | ||
3. Heap memory on any query component to be exhausted, leading to the pod being out-of-memory and killed. | ||
|
||
It's important to protect Cortex components by setting appropriate limits and throttling configurations based on your infrastructure and data ingested by the customers. | ||
|
||
## Static limits | ||
|
||
There are number of static limits that you could configure to block heavy queries from running. | ||
|
||
### Max outstanding requests per tenant | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_outstanding_requests_per_tenant for details. | ||
|
||
### Max data bytes fetched per (sharded) query | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_data_bytes_per_query for details. | ||
|
||
### Max series fetched per (sharded) query | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_series_per_query for details. | ||
|
||
### Max chunks fetched per (sharded) query | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_chunk_bytes_per_query for details. | ||
|
||
### Max samples fetched per (sharded) query | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/#querier_config:~:text=max_samples for details. | ||
|
||
## Resource-based throttling (Experimental) | ||
|
||
Although the static limits are able to protect Cortex components from specific query patterns, they are not generic enough to cover different combinations of bad query patterns. For example, what if the query fetches relatively large postings, series and chunks that are slightly below the individual limits? For a more generic solution, you can enable resource-based throttling by setting CPU and heap utilization thresholds. | ||
|
||
Currently, it only throttles incoming query requests with error code 429 (too many requests) when the resource usage breaches the configured thresholds. | ||
|
||
For example, the following configuration will start throttling query requests if either CPU or heap utilization is above 80%, leaving 20% of room for inflight requests. | ||
|
||
``` | ||
target: ingester | ||
monitored_resources: cpu,heap | ||
instance_limits: | ||
cpu_utilization: 0.8 | ||
heap_utilization: 0.8 | ||
``` | ||
|
||
See https://cortexmetrics.io/docs/configuration/configuration-file/:~:text=instance_limits for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
package configs | ||
|
||
import ( | ||
"errors" | ||
"flag" | ||
"strings" | ||
|
||
"github.com/cortexproject/cortex/pkg/util/flagext" | ||
"github.com/cortexproject/cortex/pkg/util/resource" | ||
) | ||
|
||
type InstanceLimits struct { | ||
justinjung04 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
CPUUtilization float64 `yaml:"cpu_utilization"` | ||
HeapUtilization float64 `yaml:"heap_utilization"` | ||
} | ||
|
||
func (cfg *InstanceLimits) RegisterFlagsWithPrefix(f *flag.FlagSet, prefix string) { | ||
f.Float64Var(&cfg.CPUUtilization, prefix+"instance-limits.cpu-utilization", 0, "EXPERIMENTAL: Max CPU utilization that this ingester can reach before rejecting new query request (across all tenants) in percentage, between 0 and 1. monitored_resources config must include the resource type. 0 to disable.") | ||
f.Float64Var(&cfg.HeapUtilization, prefix+"instance-limits.heap-utilization", 0, "EXPERIMENTAL: Max heap utilization that this ingester can reach before rejecting new query request (across all tenants) in percentage, between 0 and 1. monitored_resources config must include the resource type. 0 to disable.") | ||
} | ||
|
||
func (cfg *InstanceLimits) Validate(monitoredResources flagext.StringSliceCSV) error { | ||
if cfg.CPUUtilization > 1 || cfg.CPUUtilization < 0 { | ||
return errors.New("cpu_utilization must be between 0 and 1") | ||
} | ||
|
||
if cfg.CPUUtilization > 0 && !strings.Contains(monitoredResources.String(), string(resource.CPU)) { | ||
justinjung04 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return errors.New("monitored_resources config must include \"cpu\" as well") | ||
} | ||
|
||
if cfg.HeapUtilization > 1 || cfg.HeapUtilization < 0 { | ||
return errors.New("heap_utilization must be between 0 and 1") | ||
} | ||
|
||
if cfg.HeapUtilization > 0 && !strings.Contains(monitoredResources.String(), string(resource.Heap)) { | ||
return errors.New("monitored_resources config must include \"heap\" as well") | ||
} | ||
|
||
return nil | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
package configs | ||
|
||
import ( | ||
"errors" | ||
"testing" | ||
|
||
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func Test_Validate(t *testing.T) { | ||
for name, tc := range map[string]struct { | ||
instanceLimits InstanceLimits | ||
monitoredResources []string | ||
err error | ||
}{ | ||
"correct config should pass validation": { | ||
instanceLimits: InstanceLimits{ | ||
CPUUtilization: 0.5, | ||
HeapUtilization: 0.5, | ||
}, | ||
monitoredResources: []string{"cpu", "heap"}, | ||
err: nil, | ||
}, | ||
"utilization config less than 0 should fail validation": { | ||
instanceLimits: InstanceLimits{ | ||
CPUUtilization: -0.5, | ||
HeapUtilization: 0.5, | ||
}, | ||
monitoredResources: []string{"cpu", "heap"}, | ||
err: errors.New("cpu_utilization must be between 0 and 1"), | ||
}, | ||
"utilization config greater than 1 should fail validation": { | ||
instanceLimits: InstanceLimits{ | ||
CPUUtilization: 0.5, | ||
HeapUtilization: 1.5, | ||
}, | ||
monitoredResources: []string{"cpu", "heap"}, | ||
err: errors.New("heap_utilization must be between 0 and 1"), | ||
}, | ||
"missing cpu in monitored_resources config should fail validation": { | ||
instanceLimits: InstanceLimits{ | ||
CPUUtilization: 0.5, | ||
}, | ||
monitoredResources: []string{"heap"}, | ||
err: errors.New("monitored_resources config must include \"cpu\" as well"), | ||
}, | ||
"missing heap in monitored_resources config should fail validation": { | ||
instanceLimits: InstanceLimits{ | ||
HeapUtilization: 0.5, | ||
}, | ||
monitoredResources: []string{"cpu"}, | ||
err: errors.New("monitored_resources config must include \"heap\" as well"), | ||
}, | ||
} { | ||
t.Run(name, func(t *testing.T) { | ||
err := tc.instanceLimits.Validate(tc.monitoredResources) | ||
if tc.err != nil { | ||
require.Errorf(t, err, tc.err.Error()) | ||
} else { | ||
require.NoError(t, err) | ||
} | ||
}) | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.