-
Notifications
You must be signed in to change notification settings - Fork 10k
fail early with ErrTooManySamples in rangeEval #9626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bwplotka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally nice!
Can we double check the performance and if we have unit test for that?
codesome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we detect and fail early, I am missing the parts where currentSamples should have been updated the similar way (I see that it is totally removed). We will need tests for that too.
|
Picking this up as part of our bug scrub… @darshanime are you still up to working on this? What do you think about @codesome's comment? Test coverage is also an issue that needs to be addressed before merging this. |
|
Closing as no update after 6 months. We noted at the bug-scrub that a lot of code in this area has changed due to native histograms, so it probably has to be re-coded from scratch. |
3592c27 to
468e039
Compare
| End: time.Unix(220, 0), | ||
| Interval: 5 * time.Second, | ||
| PeakSamples: 5, | ||
| PeakSamples: 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for reviewers: 8 because
- 4 samples of
metricWith1SampleEvery10Secondsloaded atts 201, 206, 211, 216 - 4 samples from 4 calls to
funcTimestampused to construct the response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but those four copies are not kept in memory at the same time, but only one at a time.
I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243
|
|
||
| // Since we are copying the samples in memory here, check for ErrTooManySamples. | ||
| rangeEvalSamples += matrixes[i].TotalSamples() | ||
| if rangeEvalSamples+ev.currentSamples > ev.maxSamples { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for reviewers: we check for ErrTooManySamples after evaluating each expression and fail early if maxSamples breached.
468e039 to
a5fdd7d
Compare
Signed-off-by: darshanime <deathbullet@gmail.com>
a5fdd7d to
caa3d9f
Compare
beorn7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the internals of the PromQL engine, but I have the suspicion that this approach is doing something wrong. I need to dig deeper to find out more, but it would also be great if somebody with a better understanding of the engine could have a look.
| if prepSeries != nil { | ||
| bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si]) | ||
| } | ||
| ev.currentSamples++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was wrong before. In case of a histogram, we also need to add H.Size()/16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe not, because we don't Copy the histogram, just the pointer to it. In that case, we need a comment here. Will think more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, pretty sure the current state is correct because we only copy the pointer. I'll add a code comment here in a PR that I'm already preparing for related issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ev.currentSamples++this line is deleted because we have already accounted all the samples loaded in line 1126
| matrixes[i] = val.(Matrix) | ||
|
|
||
| // Since we are copying the samples in memory here, check for ErrTooManySamples. | ||
| rangeEvalSamples += matrixes[i].TotalSamples() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is correct. After the eval call in line 1121, ev.currentSamples should be updated with the number of samples in the result. By just adding that size again, aren't we doing the same thing twice?
In the loop in line 1190 and following, we are just finding one sample for each time stamp, and that sample contributes to the peak. I'm not sure my understanding is correct, but it seems this line heavily overestimates the peak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By just adding that size again, aren't we doing the same thing twice?
We are doing that because we are copying the samples in line 1123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...
|
@codesome maybe you can help us out here? |
beorn7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought more about this, I don't think the change in this PR is correct. I think it overcounts the samples kept in memory at peak. (And that's why all affected unit tests had to be adjusted up.)
I'm not sure it is even possible to implement a "pre-counting" correctly, but even if it is, I'm not sure it is worth it. Error'ing out with an exceeded sample count isn't supposed to happen often, so we would only save resources in the rare case that a query actually exceeds the limit. Sure, nice to know this more quickly, but is that worth added complexity?
I am not very familiar with the code touched here and I had to think about this PR for quite some time (which is not necessarily lost time because it helped me to get a bit more familiar with the code), but I would say in lack of somebody very familiar with this part of the codebase, let's rather close this PR. @darshanime if you feel confident you can solve this correctly (or maybe I got something wrong here and you can explain it to me), please let me know.
| End: time.Unix(220, 0), | ||
| Interval: 5 * time.Second, | ||
| PeakSamples: 5, | ||
| PeakSamples: 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)
|
BTW: The sub-queries are much harder to reason with. I cannot really explain the precise values in the tests. (Perhaps @codesome can, but he has still very limited availability.) |
|
In any case, I'll create a follow-up PR where I had histograms to the tests, because reviewing this PR showed me that they are still missing. |
darshanime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beorn7, ptal, I have replied to your comments. I think this patch makes the accounting more accurate and prevents potential OOM.
| End: time.Unix(220, 0), | ||
| Interval: 5 * time.Second, | ||
| PeakSamples: 5, | ||
| PeakSamples: 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but those four copies are not kept in memory at the same time, but only one at a time.
I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243
| if prepSeries != nil { | ||
| bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si]) | ||
| } | ||
| ev.currentSamples++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ev.currentSamples++this line is deleted because we have already accounted all the samples loaded in line 1126
| matrixes[i] = val.(Matrix) | ||
|
|
||
| // Since we are copying the samples in memory here, check for ErrTooManySamples. | ||
| rangeEvalSamples += matrixes[i].TotalSamples() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By just adding that size again, aren't we doing the same thing twice?
We are doing that because we are copying the samples in line 1123
| matrixes[i] = val.(Matrix) | ||
|
|
||
| // Since we are copying the samples in memory here, check for ErrTooManySamples. | ||
| rangeEvalSamples += matrixes[i].TotalSamples() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...
|
Thanks for your comments. I'll look into them once I find time. (But I have to say that I feel heavily underqualified for this review. I guess we need help of somebody who is more familiar with this sample counting, maybe @jesusvazquez or @codesome . I'll take it as a challenge to learn more about it, but it will take me a lot of time, and I'm heavily distracted with other tasks.) |
|
Note to reviewers: #10369 is the PR that implemented the feature. This might be helpful to find out what the intended meaning of the stats fields is. |
|
Hello from the bug-scrub! I can assign myself to review this, but I would like to know if you will update after #13744, @darshanime. |
|
i wont have time to get back to this rn, can close for now |
account for the samples loaded in memory before
evalof next expressionSigned-off-by: darshanime deathbullet@gmail.com