fail early with ErrTooManySamples in rangeEval #9626

darshanime · 2021-10-30T08:13:31Z

account for the samples loaded in memory before eval of next expression

Signed-off-by: darshanime deathbullet@gmail.com

bwplotka

Looks generally nice!

Can we double check the performance and if we have unit test for that?

codesome

Though we detect and fail early, I am missing the parts where currentSamples should have been updated the similar way (I see that it is totally removed). We will need tests for that too.

beorn7 · 2023-08-15T13:55:26Z

Picking this up as part of our bug scrub… @darshanime are you still up to working on this? What do you think about @codesome's comment? Test coverage is also an issue that needs to be addressed before merging this.

bboreham · 2024-01-23T11:20:17Z

Closing as no update after 6 months. We noted at the bug-scrub that a lot of code in this area has changed due to native histograms, so it probably has to be re-coded from scratch.

darshanime · 2024-02-01T11:32:59Z

promql/engine_test.go

 			End:          time.Unix(220, 0),
 			Interval:     5 * time.Second,
-			PeakSamples:  5,
+			PeakSamples:  8,


note for reviewers: 8 because

4 samples of metricWith1SampleEvery10Seconds loaded at ts 201, 206, 211, 216

4 samples from 4 calls to funcTimestamp used to construct the response

I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)

but those four copies are not kept in memory at the same time, but only one at a time.

I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243

darshanime · 2024-02-01T11:34:27Z

promql/engine.go


+			// Since we are copying the samples in memory here, check for ErrTooManySamples.
+			rangeEvalSamples += matrixes[i].TotalSamples()
+			if rangeEvalSamples+ev.currentSamples > ev.maxSamples {


note for reviewers: we check for ErrTooManySamples after evaluating each expression and fail early if maxSamples breached.

Signed-off-by: darshanime <deathbullet@gmail.com>

beorn7

I'm not very familiar with the internals of the PromQL engine, but I have the suspicion that this approach is doing something wrong. I need to dig deeper to find out more, but it would also be great if somebody with a better understanding of the engine could have a look.

beorn7 · 2024-02-14T17:18:50Z

promql/engine.go

 				if prepSeries != nil {
 					bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si])
 				}
-				ev.currentSamples++


I think this was wrong before. In case of a histogram, we also need to add H.Size()/16.

Or maybe not, because we don't Copy the histogram, just the pointer to it. In that case, we need a comment here. Will think more about it.

Yeah, pretty sure the current state is correct because we only copy the pointer. I'll add a code comment here in a PR that I'm already preparing for related issues.

- ev.currentSamples++

this line is deleted because we have already accounted all the samples loaded in line 1126

beorn7 · 2024-02-14T17:23:41Z

promql/engine.go

 			matrixes[i] = val.(Matrix)

+			// Since we are copying the samples in memory here, check for ErrTooManySamples.
+			rangeEvalSamples += matrixes[i].TotalSamples()


I'm not sure that this is correct. After the eval call in line 1121, ev.currentSamples should be updated with the number of samples in the result. By just adding that size again, aren't we doing the same thing twice?

In the loop in line 1190 and following, we are just finding one sample for each time stamp, and that sample contributes to the peak. I'm not sure my understanding is correct, but it seems this line heavily overestimates the peak.

By just adding that size again, aren't we doing the same thing twice?

We are doing that because we are copying the samples in line 1123

note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...

beorn7 · 2024-02-14T18:35:26Z

@codesome maybe you can help us out here?

beorn7

Having thought more about this, I don't think the change in this PR is correct. I think it overcounts the samples kept in memory at peak. (And that's why all affected unit tests had to be adjusted up.)

I'm not sure it is even possible to implement a "pre-counting" correctly, but even if it is, I'm not sure it is worth it. Error'ing out with an exceeded sample count isn't supposed to happen often, so we would only save resources in the rare case that a query actually exceeds the limit. Sure, nice to know this more quickly, but is that worth added complexity?

I am not very familiar with the code touched here and I had to think about this PR for quite some time (which is not necessarily lost time because it helped me to get a bit more familiar with the code), but I would say in lack of somebody very familiar with this part of the codebase, let's rather close this PR. @darshanime if you feel confident you can solve this correctly (or maybe I got something wrong here and you can explain it to me), please let me know.

beorn7 · 2024-02-28T13:40:47Z

promql/engine_test.go

 			End:          time.Unix(220, 0),
 			Interval:     5 * time.Second,
-			PeakSamples:  5,
+			PeakSamples:  8,


I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)

beorn7 · 2024-02-28T15:07:34Z

BTW: The sub-queries are much harder to reason with. I cannot really explain the precise values in the tests. (Perhaps @codesome can, but he has still very limited availability.)

beorn7 · 2024-02-28T15:08:07Z

In any case, I'll create a follow-up PR where I had histograms to the tests, because reviewing this PR showed me that they are still missing.

darshanime

@beorn7, ptal, I have replied to your comments. I think this patch makes the accounting more accurate and prevents potential OOM.

darshanime · 2024-03-03T13:04:15Z

promql/engine_test.go

 			End:          time.Unix(220, 0),
 			Interval:     5 * time.Second,
-			PeakSamples:  5,
+			PeakSamples:  8,


but those four copies are not kept in memory at the same time, but only one at a time.

I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243

darshanime · 2024-03-03T13:04:22Z

promql/engine.go

 				if prepSeries != nil {
 					bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si])
 				}
-				ev.currentSamples++


- ev.currentSamples++

this line is deleted because we have already accounted all the samples loaded in line 1126

darshanime · 2024-03-03T13:04:35Z

promql/engine.go

 			matrixes[i] = val.(Matrix)

+			// Since we are copying the samples in memory here, check for ErrTooManySamples.
+			rangeEvalSamples += matrixes[i].TotalSamples()


By just adding that size again, aren't we doing the same thing twice?

We are doing that because we are copying the samples in line 1123

darshanime · 2024-03-03T13:08:05Z

promql/engine.go

 			matrixes[i] = val.(Matrix)

+			// Since we are copying the samples in memory here, check for ErrTooManySamples.
+			rangeEvalSamples += matrixes[i].TotalSamples()


note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...

beorn7 · 2024-03-07T19:41:47Z

Thanks for your comments. I'll look into them once I find time. (But I have to say that I feel heavily underqualified for this review. I guess we need help of somebody who is more familiar with this sample counting, maybe @jesusvazquez or @codesome . I'll take it as a challenge to learn more about it, but it will take me a lot of time, and I'm heavily distracted with other tasks.)

beorn7 · 2024-03-26T12:08:30Z

Note to reviewers: #10369 is the PR that implemented the feature. This might be helpful to find out what the intended meaning of the stats fields is.

beorn7 · 2024-04-09T16:23:48Z

Still not enough capacity on my side to generate conclusive evidence here, but with #13744 in, it might be worth updating this PR once more because #13744 simplifies code that is part of the discussion here, so after #13744, it might be easier to reason about.

bboreham · 2024-10-08T11:46:18Z

Hello from the bug-scrub!

I can assign myself to review this, but I would like to know if you will update after #13744, @darshanime.

darshanime · 2025-01-05T11:39:31Z

i wont have time to get back to this rn, can close for now

darshanime requested review from codesome and roidelapluie as code owners October 30, 2021 08:13

bwplotka reviewed Nov 16, 2021

View reviewed changes

codesome reviewed Nov 17, 2021

View reviewed changes

stale bot added the stale label Jan 20, 2022

bboreham closed this Jan 23, 2024

bboreham reopened this Jan 30, 2024

darshanime force-pushed the samples_accounting branch 2 times, most recently from 3592c27 to 468e039 Compare February 1, 2024 11:27

darshanime commented Feb 1, 2024

View reviewed changes

darshanime force-pushed the samples_accounting branch from 468e039 to a5fdd7d Compare February 1, 2024 13:39

Fail early with ErrTooManySamples

caa3d9f

Signed-off-by: darshanime <deathbullet@gmail.com>

darshanime force-pushed the samples_accounting branch from a5fdd7d to caa3d9f Compare February 5, 2024 08:21

beorn7 reviewed Feb 14, 2024

View reviewed changes

beorn7 reviewed Feb 28, 2024

View reviewed changes

beorn7 mentioned this pull request Feb 28, 2024

Improve TestQueryStatistics and fix bugs exposed by it #13667

Merged

darshanime commented Mar 3, 2024

View reviewed changes

krajorama mentioned this pull request Jul 18, 2024

query-frontend: add experimental header for tracking throughput grafana/mimir#7966

Closed

2 tasks

bboreham self-assigned this Oct 8, 2024

darshanime closed this Jan 5, 2025

fail early with ErrTooManySamples in rangeEval #9626

fail early with ErrTooManySamples in rangeEval #9626

Uh oh!

Conversation

darshanime commented Oct 30, 2021

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Aug 15, 2023

Uh oh!

bboreham commented Jan 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Feb 14, 2024

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Feb 28, 2024

Uh oh!

beorn7 commented Feb 28, 2024

Uh oh!

darshanime left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Mar 7, 2024

Uh oh!

beorn7 commented Mar 26, 2024

Uh oh!

beorn7 commented Apr 9, 2024

Uh oh!

bboreham commented Oct 8, 2024

Uh oh!

darshanime commented Jan 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!