Skip to content

Add ResourceMonitor module in Cortex, and add ResourceBasedLimiter in Ingesters and StoreGateways #6674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 18, 2025

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Mar 26, 2025

What this PR does:

This PR introduces ability to throttle incoming query requests in ingesters and store gateways when their CPU and heap is under pressure.

Data stores (ingesters and store gateways) currently don't have good ways to limit and control resource allocation per query request. Each query request has huge variance in its resource consumption, so it's hard to define static limits to protect ingesters or store gateways from using more than 100% CPU or being OOMkilled.

I'm introducing two new experimental components:

  • ResourceMonitor is a new cortex module that takes snapshot of utilization of resources (cpu and heap for now) every 100 milliseconds, and other cortex modules can read those values
  • ResourceBasedLimiter is a new limiter added in Ingesters and StoreGateways, which basically checks whether any of the resource utilization is equal to or above the configured limit. For now, Ingesters and StoreGateways reject incoming query requests when the limit is reached.

Here is a test where high TPS of queries exhausting ingester CPU was throttled by the new feature, stabalizing the ingester CPU at around configured threshold of 40%.

Screenshot 2025-03-26 at 9 36 30 AM

Sample configurations:

# config for ingester
-target=ingester
-monitored.resources=cpu,heap
-ingester.instance-limits.cpu-utilization=0.8
-ingester.instance-limits.heap-utilization=0.8
# config for store gateway
-target=store-gateway
-monitored.resources=cpu,heap
-store-gateway.instance-limits.cpu-utilization=0.8
-store-gateway.instance-limits.heap-utilization=0.8

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 force-pushed the resource-based-throttling branch from 30d1cba to 9efbbd9 Compare March 26, 2025 16:27
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Resource based throttling Add resource-thresholds to throttle query requests when the pods are under resource pressure. Mar 26, 2025
@justinjung04 justinjung04 changed the title Add resource-thresholds to throttle query requests when the pods are under resource pressure. Add resource-thresholds in ingesters and store gateways to throttle query requests when the pods are under resource pressure. Mar 26, 2025
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 force-pushed the resource-based-throttling branch from 841d578 to 5cccd60 Compare March 26, 2025 21:55
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

When choosing how to retrieve correct CPU and heap data, I basically tested different metrics from https://pkg.go.dev/runtime/metrics and https://github.com/prometheus/procfs, compared with kubernetes metrics to find closest metrics. I thought it's unnecessary to comment about different metrics that I tried, but let me know if you believe I should mention about it somewhere.

Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 force-pushed the resource-based-throttling branch from 2081854 to 08a6adf Compare March 31, 2025 22:45
Signed-off-by: Justin Jung <jungjust@amazon.com>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also mark this feature as experimental and mention it in https://cortexmetrics.io/docs/configuration/v1guarantees/#experimental-features

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Add resource-thresholds in ingesters and store gateways to throttle query requests when the pods are under resource pressure. Add monitored_resources config + ResourceBasedLimiter in ingesters and store gateways Apr 10, 2025
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Add monitored_resources config + ResourceBasedLimiter in ingesters and store gateways Add ResourceMonitor module in Cortex, and add ResourceBasedLimiter in Ingesters and StoreGateways Apr 10, 2025
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

justinjung04 commented Apr 15, 2025

@yeya24 I've updated the PR to split the code into two parts:

  • ResourceMonitor is a new module in Cortex that takes snapshot of cpu and heap utilization every 100 milliseconds, and other cortex modules can read those values.
  • ResourceBasedLimiter is a new limiter added in Ingesters and StoreGateway, which basically checks whether any of the resource utilization is equal to or above the configured limit. For now, Ingesters and StoreGateways reject incoming query requests when the limit is reached.

Sample configuration

# config for ingester
-target=ingester
-monitored.resources=cpu,heap
-ingester.instance-limits.cpu-utilization=0.8
-ingester.instance-limits.heap-utilization=0.8
# config for store gateway
-target=store-gateway
-monitored.resources=cpu,heap
-store-gateway.instance-limits.cpu-utilization=0.8
-store-gateway.instance-limits.heap-utilization=0.8

Signed-off-by: Justin Jung <jungjust@amazon.com>
@yeya24
Copy link
Contributor

yeya24 commented Apr 15, 2025

     query_fuzz_test.go:1790: case 926 results mismatch.
        instant query: (
            histogram_fraction(
              (-0.8770476003771546 == bool time()),
              scalar(deg({__name__="test_series_a"})),
              sort(
                ({__name__="test_series_a",series!~".*"} offset 3m1s or {__name__="test_series_a",job=~"te.*"})
              )
            )
          *
            {__name__="test_series_a"} offset 4m35s
        )
        res1: {job="test", series="0", status_code="200"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="400"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="500"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="502"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="400"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="404"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="502"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="200"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="404"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="500"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        res2: {job="test", series="0", status_code="200"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="400"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="500"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="0", status_code="502"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="400"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="404"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="1", status_code="502"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="200"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="404"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
        {job="test", series="2", status_code="500"} => Count: NaN, Sum: NaN, Buckets: [[-4,-2.82842712474619):NaN [-2.82842712474619,-2):NaN [-1.414213562373095,-1):NaN [-1,-0.7071067811865475):NaN (0.7071067811865475,1]:NaN (1,1.414213562373095]:NaN (2,2.82842712474619]:NaN (2.82842712474619,4]:NaN] @[1744316300.953]
    query_fuzz_test.go:1795: 
        	Error Trace:	/home/runner/work/cortex/cortex/integration/query_fuzz_test.go:1795
        	            				/home/runner/work/cortex/cortex/integration/query_fuzz_test.go:161
        	Error:      	finished query fuzzing tests
        	Test:       	TestNativeHistogramFuzz
        	Messages:   	1 test cases failed

@SungJin1212 Maybe you can help take a look at this failure? I don't see how the 2 results are different.

@SungJin1212
Copy link
Member

@yeya24
When the Count is NaN, the == returns a false. I fix it here: #6700.

Signed-off-by: Justin Jung <jungjust@amazon.com>

if i.resourceBasedLimiter != nil {
if err := i.resourceBasedLimiter.AcceptNewRequest(); err != nil {
level.Warn(i.logger).Log("msg", "failed to accept request", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to log the error here if query stats will report it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't query stats one per request and the errors are somehow aggregated? I imagined having ingester-level or store-gateway-level log will be helpful as it's related to the pod utilization limit that's breached, not a query-level limit

@yeya24
Copy link
Contributor

yeya24 commented Apr 17, 2025

22:52:24 cortex-2: ts=2025-04-17T22:52:24.926522939Z caller=handler.go:83 level=warn component=cluster caller=cluster.go:262 time=2025-04-17T22:52:24.926515866Z msg="failed to join cluster" err="1 error occurred:\n\t* Failed to join 127.0.0.1:32773: dial tcp 127.0.0.1:32773: connect: connection refused\n\n"
22:52:24 cortex-2: ts=2025-04-17T22:52:24.926583732Z caller=multitenant.go:344 level=warn msg="unable to join gossip mesh while initializing cluster for high availability mode" err="1 error occurred:\n\t* Failed to join 127.0.0.1:32773: dial tcp 127.0.0.1:32773: connect: connection refused\n\n"
22:52:24 cortex-2: panic: duplicate metrics collector registration attempted
22:52:24 cortex-2: goroutine 1 [running]:
22:52:24 cortex-2: github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x51a7fe0, {0xc000d321a0?, 0x0?, 0x0?})
22:52:24 cortex-2: /__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/registry.go:406 +0x65
22:52:24 cortex-2: github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounterVec({{0x37b7c70?, 0x51a7fe0?}}, {{0x0, 0x0}, {0x0, 0x0}, {0x31e75d8, 0x2c}, {0x3223df6, 0x3f}, ...}, ...)
22:52:24 cortex-2: /__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:276 +0x163
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/util/limiter.NewResourceBasedLimiter({0x37afbe8, 0xc000f45408}, 0xc000ca2cc0, {0x37b7c70, 0x51a7fe0})
22:52:24 cortex-2: /__w/cortex/cortex/pkg/util/limiter/resource_based_limiter.go:42 +0x27d
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/ingester.New({{{{{...}, {...}, {...}, {...}}, _, _, _, {_, _, _}, ...}, ...}, ...}, ...)
22:52:24 cortex-2: /__w/cortex/cortex/pkg/ingester/ingester.go:793 +0xdf8
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/cortex.(*Cortex).initIngesterService(0xc000885008)
22:52:24 cortex-2: /__w/cortex/cortex/pkg/cortex/modules.go:448 +0x250
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule(0xc00000e5b8, {0x7ffe8d865f89, 0x3}, 0xc0011d9b40, 0xc000c816b0)
22:52:24 cortex-2: /__w/cortex/cortex/pkg/util/modules/modules.go:106 +0x21f
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices(0xc00000e5b8, {0xc001293cb0, 0x1, 0x3?})
22:52:24 cortex-2: /__w/cortex/cortex/pkg/util/modules/modules.go:78 +0xf4
22:52:24 cortex-2: github.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run(0xc000885008)
22:52:24 cortex-2: /__w/cortex/cortex/pkg/cortex/cortex.go:431 +0x39c
22:52:24 cortex-2: main.main()
22:52:24 cortex-2: /__w/cortex/cortex/cmd/cortex/main.go:202 +0xed0
    api_endpoints_test.go:86: 
        	Error Trace:	/home/runner/work/cortex/cortex/integration/api_endpoints_test.go:86
        	Error:      	Received unexpected error:
        	            	docker container cortex-2 failed to start: exit status 1
        	Test:       	TestConfigAPIEndpoint
22:52:47 Killing cortex-1

This seems related to the latest changes on metrics

Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 force-pushed the resource-based-throttling branch from c0e5514 to 6ffef63 Compare April 17, 2025 23:28
Signed-off-by: Justin Jung <jungjust@amazon.com>
@yeya24 yeya24 merged commit 6d1bd1b into cortexproject:master Apr 18, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants