Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: native histogram: data race in addExemplar #1609

Closed
wants to merge 4 commits into from

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented Aug 31, 2024

Fixes: #1605

Slices cap function is not thread safe. Reuse instead the ttl to store the ttl atomically with -1 indicating exemplars turned off.

Includes fix for #1607 from #1608 .

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Instead of  checking non thread safe slice capacity

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
mdIdx was redundant when len(exemplars)>1, so got rid of it, rIdx
is enough.

Don't compare timestamp of incoming exemplar to timestamp of
minimal distance exemplar. Most of the time the incoming exemplar
will be newer. And if not, the previous code just replaced an
exemplar one index after the minimal distance exemplar. Which had
an index out of range bug, plus is essentially random.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

# Conflicts:
#	prometheus/histogram.go
@krajorama krajorama marked this pull request as ready for review August 31, 2024 11:08
@krajorama krajorama requested a review from beorn7 August 31, 2024 11:08
@krajorama
Copy link
Member Author

cc @fatsheep9146

@beorn7
Copy link
Member

beorn7 commented Sep 1, 2024

Code looks good, but I'm not quite sure if we should do this kind of optimization via ttl as an atomic int. Those atomic operations still have a cost (and it could be significant on systems with many cores). The usual code path is to call ObserveWithExemplar on a histogram with configured exemplars, and just call Observe on a histogram where exemplars are not configured. The former case becomes even slower with this change here, while the latter case isn't affected at all.
I'm wondering if there is a legitimate use case for calling ObserveWithExemplar on a histogram with exemplars disabled. And even if so, if it is common enough to make this PR a net gain.

@krajorama
Copy link
Member Author

Code looks good, but I'm not quite sure if we should do this kind of optimization via ttl as an atomic int. Those atomic operations still have a cost (and it could be significant on systems with many cores). The usual code path is to call ObserveWithExemplar on a histogram with configured exemplars, and just call Observe on a histogram where exemplars are not configured. The former case becomes even slower with this change here, while the latter case isn't affected at all. I'm wondering if there is a legitimate use case for calling ObserveWithExemplar on a histogram with exemplars disabled. And even if so, if it is common enough to make this PR a net gain.

Well, TTL is constant, so in theory we don't need to use atomics. Pushed that version.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the test in the first commit to show the bug. Helps a lot with the review :)

I do understand that using ttl instead of cap(n.exemplars) solves the data race condition, so LGTM now that tests pass.

About the index out of range problem. I'm having a hard time understanding the codebase, but I believe in Beorn's expertise here. Out of scope for this PR but I think more descriptive variable names could help

Comment on lines +1722 to +1725
md = -1.0 // Logarithm of the delta of the closest pair of exemplars.
rIdx = -1 // Index of the older exemplar within the closest pair and where we need to insert the new exemplar.
cLog float64 // Logarithm of the current exemplar.
pLog float64 // Logarithm of the previous exemplar.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, totally out of scope for this PR, but more descriptive variable names would remove the necessity for commenting on their meaning 😕

I'm not sure if I'm missing something, but is there a reason to name these variables rIdx, cLog, pLog instead of removeIndex, currentLog or previousLog?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh, reading again replaceIndex is probably a better name for rIdx. Does it make nIdx obsolete? (Do I understand correctly that nIdx is newIndex? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, we had very different "tastes" playing out between various Prometheus developers, all the way from very long and descriptive names to "use one letter names whenever possible" (slightly exaggerated).

My observation is that Go developers tend towards shorter variable names (have a look at the standard library, for example). But we never explicitly stated a preference for the Prometheus project. In lack of a specific guideline, I would go with generally good practices of programming like "the wider the scope of a variable, the more descriptive its name".

In this case, the variables are local to a (fairly long) function. So I wouldn't be too demanding about the descriptiveness of their names. Furthermore, even previousLog wouldn't really be self explanatory. (Previous log line?) And a name like logarithmOfPreviousExemplar would be clearly too clunky. Given that the comment is needed anyway, I actually like the names as they are. Still very short, but better than just c, p, and r (which others might even prefer).

@ArthurSens
Copy link
Member

Should we close #1608 in favor of this one?

@krajorama
Copy link
Member Author

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

@ArthurSens
Copy link
Member

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

ttl doesn't receive any writes after its creation. Look safe and efficient to me :)

@krajorama
Copy link
Member Author

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

ttl doesn't receive any writes after its creation. Look safe and efficient to me :)

Ack, I'll update the other PR and we can just close this if we merge that.

krajorama added a commit to krajorama/client_golang that referenced this pull request Sep 2, 2024
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

krajorama commented Sep 3, 2024

Discussed on chat with @beorn7 , using ttl==-1 is ok. Closing this as concluded. Solution in delivered #1607

@krajorama krajorama closed this Sep 3, 2024
krajorama added a commit to krajorama/client_golang that referenced this pull request Sep 4, 2024
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in native histogram
3 participants