Skip to content

Conversation

@darshanime
Copy link
Contributor

account for the samples loaded in memory before eval of next expression

Signed-off-by: darshanime deathbullet@gmail.com

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally nice!

Can we double check the performance and if we have unit test for that?

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we detect and fail early, I am missing the parts where currentSamples should have been updated the similar way (I see that it is totally removed). We will need tests for that too.

@stale stale bot added the stale label Jan 20, 2022
@beorn7
Copy link
Member

beorn7 commented Aug 15, 2023

Picking this up as part of our bug scrub… @darshanime are you still up to working on this? What do you think about @codesome's comment? Test coverage is also an issue that needs to be addressed before merging this.

@bboreham
Copy link
Member

Closing as no update after 6 months. We noted at the bug-scrub that a lot of code in this area has changed due to native histograms, so it probably has to be re-coded from scratch.

@bboreham bboreham closed this Jan 23, 2024
@bboreham bboreham reopened this Jan 30, 2024
@darshanime darshanime force-pushed the samples_accounting branch 2 times, most recently from 3592c27 to 468e039 Compare February 1, 2024 11:27
End: time.Unix(220, 0),
Interval: 5 * time.Second,
PeakSamples: 5,
PeakSamples: 8,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for reviewers: 8 because

  • 4 samples of metricWith1SampleEvery10Seconds loaded at ts 201, 206, 211, 216
  • 4 samples from 4 calls to funcTimestamp used to construct the response

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but those four copies are not kept in memory at the same time, but only one at a time.

I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243


// Since we are copying the samples in memory here, check for ErrTooManySamples.
rangeEvalSamples += matrixes[i].TotalSamples()
if rangeEvalSamples+ev.currentSamples > ev.maxSamples {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for reviewers: we check for ErrTooManySamples after evaluating each expression and fail early if maxSamples breached.

Signed-off-by: darshanime <deathbullet@gmail.com>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the internals of the PromQL engine, but I have the suspicion that this approach is doing something wrong. I need to dig deeper to find out more, but it would also be great if somebody with a better understanding of the engine could have a look.

if prepSeries != nil {
bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si])
}
ev.currentSamples++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was wrong before. In case of a histogram, we also need to add H.Size()/16.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe not, because we don't Copy the histogram, just the pointer to it. In that case, we need a comment here. Will think more about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, pretty sure the current state is correct because we only copy the pointer. I'll add a code comment here in a PR that I'm already preparing for related issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- ev.currentSamples++

this line is deleted because we have already accounted all the samples loaded in line 1126

matrixes[i] = val.(Matrix)

// Since we are copying the samples in memory here, check for ErrTooManySamples.
rangeEvalSamples += matrixes[i].TotalSamples()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is correct. After the eval call in line 1121, ev.currentSamples should be updated with the number of samples in the result. By just adding that size again, aren't we doing the same thing twice?

In the loop in line 1190 and following, we are just finding one sample for each time stamp, and that sample contributes to the peak. I'm not sure my understanding is correct, but it seems this line heavily overestimates the peak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By just adding that size again, aren't we doing the same thing twice?

We are doing that because we are copying the samples in line 1123

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...

@beorn7
Copy link
Member

beorn7 commented Feb 14, 2024

@codesome maybe you can help us out here?

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about this, I don't think the change in this PR is correct. I think it overcounts the samples kept in memory at peak. (And that's why all affected unit tests had to be adjusted up.)

I'm not sure it is even possible to implement a "pre-counting" correctly, but even if it is, I'm not sure it is worth it. Error'ing out with an exceeded sample count isn't supposed to happen often, so we would only save resources in the rare case that a query actually exceeds the limit. Sure, nice to know this more quickly, but is that worth added complexity?

I am not very familiar with the code touched here and I had to think about this PR for quite some time (which is not necessarily lost time because it helped me to get a bit more familiar with the code), but I would say in lack of somebody very familiar with this part of the codebase, let's rather close this PR. @darshanime if you feel confident you can solve this correctly (or maybe I got something wrong here and you can explain it to me), please let me know.

End: time.Unix(220, 0),
Interval: 5 * time.Second,
PeakSamples: 5,
PeakSamples: 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old value (5) is correct. We load 4 samples into memory for the four steps we evaluate. Then the timestamp function is evaluated in a loop for each, each time copying one sample, but those four copies are not kept in memory at the same time, but only one at a time. So peak of 5 is correct. (And this is in line with the suspicion I formulated above.)

@beorn7
Copy link
Member

beorn7 commented Feb 28, 2024

BTW: The sub-queries are much harder to reason with. I cannot really explain the precise values in the tests. (Perhaps @codesome can, but he has still very limited availability.)

@beorn7
Copy link
Member

beorn7 commented Feb 28, 2024

In any case, I'll create a follow-up PR where I had histograms to the tests, because reviewing this PR showed me that they are still missing.

Copy link
Contributor Author

@darshanime darshanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beorn7, ptal, I have replied to your comments. I think this patch makes the accounting more accurate and prevents potential OOM.

End: time.Unix(220, 0),
Interval: 5 * time.Second,
PeakSamples: 5,
PeakSamples: 8,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but those four copies are not kept in memory at the same time, but only one at a time.

I think this is incorrect and they are all stored in memory at the same time to construct the response matrix, see 1243

if prepSeries != nil {
bufHelpers[i] = append(bufHelpers[i], seriesHelpers[i][si])
}
ev.currentSamples++
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- ev.currentSamples++

this line is deleted because we have already accounted all the samples loaded in line 1126

matrixes[i] = val.(Matrix)

// Since we are copying the samples in memory here, check for ErrTooManySamples.
rangeEvalSamples += matrixes[i].TotalSamples()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By just adding that size again, aren't we doing the same thing twice?

We are doing that because we are copying the samples in line 1123

matrixes[i] = val.(Matrix)

// Since we are copying the samples in memory here, check for ErrTooManySamples.
rangeEvalSamples += matrixes[i].TotalSamples()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this check prevents possible OOM error since we check maxSamples after loading each expression, and not while processing them post loading...

@beorn7
Copy link
Member

beorn7 commented Mar 7, 2024

Thanks for your comments. I'll look into them once I find time. (But I have to say that I feel heavily underqualified for this review. I guess we need help of somebody who is more familiar with this sample counting, maybe @jesusvazquez or @codesome . I'll take it as a challenge to learn more about it, but it will take me a lot of time, and I'm heavily distracted with other tasks.)

@beorn7
Copy link
Member

beorn7 commented Mar 26, 2024

Note to reviewers: #10369 is the PR that implemented the feature. This might be helpful to find out what the intended meaning of the stats fields is.

@beorn7
Copy link
Member

beorn7 commented Apr 9, 2024

Still not enough capacity on my side to generate conclusive evidence here, but with #13744 in, it might be worth updating this PR once more because #13744 simplifies code that is part of the discussion here, so after #13744, it might be easier to reason about.

@bboreham
Copy link
Member

bboreham commented Oct 8, 2024

Hello from the bug-scrub!

I can assign myself to review this, but I would like to know if you will update after #13744, @darshanime.

@bboreham bboreham self-assigned this Oct 8, 2024
@darshanime
Copy link
Contributor Author

i wont have time to get back to this rn, can close for now

@darshanime darshanime closed this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants