-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
ref(grouping): Use varying grouphash cache retention #104460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(grouping): Use varying grouphash cache retention #104460
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #104460 +/- ##
========================================
Coverage 80.51% 80.51%
========================================
Files 9348 9348
Lines 399919 399929 +10
Branches 25651 25651
========================================
+ Hits 322005 322022 +17
+ Misses 77466 77459 -7
Partials 448 448 |
| def _get_cache_expiry(cache_key: str, cache_type: Literal["existence", "object"]) -> int: | ||
| option_name = f"grouping.ingest_grouphash_{cache_type}_cache_expiry.trial_values" | ||
| possible_cache_expiries = options.get(option_name) | ||
| return possible_cache_expiries[hash(cache_key) % len(possible_cache_expiries)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're currently choosing the experimental group by hashing on cache_key, which is group-level granularity. This means that a user / project / organization could hit different experimental groups for different groups, which doesn't seem ideal... but given that (1) different organizations could be different "hotness" levels, which could skew our numbers, and (2) this is a fairly backend, shared cache, it should be fine.
TL;DR I talked myself out of my proposed change and think this is good now. Leaving comment for posterity. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Honestly, I could have just done it as a die-roll - math.floor(random() * 3) or something - because it really shouldn't depend on project/org/etc, for exactly the reason you state: that might introduce other variables into the experiment, because not all projects/orgs/etc are created equal. (It's even interesting to see that not even all regions are created equal - some have consistently higher hit rates than others.)
| def _get_cache_expiry(cache_key: str, cache_type: Literal["existence", "object"]) -> int: | ||
| option_name = f"grouping.ingest_grouphash_{cache_type}_cache_expiry.trial_values" | ||
| possible_cache_expiries = options.get(option_name) | ||
| return possible_cache_expiries[hash(cache_key) % len(possible_cache_expiries)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: _get_cache_expiry() can cause ZeroDivisionError if grouping.ingest_grouphash_cache_expiry.trial_values option is an empty list.
Severity: CRITICAL | Confidence: High
🔍 Detailed Analysis
The _get_cache_expiry() function in src/sentry/grouping/ingest/hashing.py can raise a ZeroDivisionError if the grouping.ingest_grouphash_cache_expiry.trial_values option is configured as an empty list. This occurs because the modulo operation hash(cache_key) % len(possible_cache_expiries) attempts to divide by zero when len(possible_cache_expiries) is 0. The option is modifiable at runtime and lacks validation to prevent empty sequences, leading to an unhandled crash during event ingest.
💡 Suggested Fix
Add validation to the option to prevent empty sequences, or implement a defensive length check before the modulo operation in _get_cache_expiry().
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/grouping/ingest/hashing.py#L213-L216
Potential issue: The `_get_cache_expiry()` function in
`src/sentry/grouping/ingest/hashing.py` can raise a `ZeroDivisionError` if the
`grouping.ingest_grouphash_cache_expiry.trial_values` option is configured as an empty
list. This occurs because the modulo operation `hash(cache_key) %
len(possible_cache_expiries)` attempts to divide by zero when
`len(possible_cache_expiries)` is 0. The option is modifiable at runtime and lacks
validation to prevent empty sequences, leading to an unhandled crash during event
ingest.
Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5860292
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, except that ALLOW_EMPTY isn't set on the option, so it never will be empty. So good instincts, Seer, but not applicable in this case.
…4537) Currently, we only tag our `grouping.get_or_create_grouphashes.check_secondary_hash_existence` and `grouping.get_or_create_grouphashes.get_or_create_grouphash` metrics with the cache expiry time in cases where there's a cache hit. When there was only one possible expiry time, that worked fine, because anything that was a cache miss also can be assumed to have used that same expiry period. However, now that we're testing multiple expiry times at once[1], it's impossible to know which expiry time led to a given miss, which in turn throws off our hit-rate calculations. To fix that problem, this PR adds the `expiry_seconds` tag to misses as well. Since by definition misses don't find anything in the cache, this also switches from using a cached expiry value to using the current expiry value. Finally, so that using the current value doesn't pollute results based on hits stored with an old value, it also introduces the use of cache versioning. That way, any time the list of potential cache times changes, no previously-stored values will be found. [1] #104460
When we began to talk about caching grouphashes during ingest, one of the considerations was how much space in the cache we'd end up using. To give ourselves a ceiling, we took the worst-case scenario and made sure it wasn't going to be a problem. According to ops it wasn't, so we proceeded. In that worst-case scenario, we never get a cache hit, every grouphash is eligible to be cached (has a group already assigned), and every event has two different grouphashes associated with it. In that case, every event we see would lead to two new hashes being added to the cache. We knew that wasn't actually going to be true - we definitely _will_ get hits, we definitely _will_ encounter new grouphashes (which we won't put in the cache because they aren't yet assigned to a group), and not every event has two grouphashes. That said, until we started doing the actual caching, we didn't know just how far from the truth that worst-case scenario was going to be. Now that we have rolled it out, we can see that in fact hashes are being added to the cache only about 12% of the time. This means we can safely play around with longer cache retention times to see if it will meaningfully improve our hit rate. To make such experimentation easier, this PR introduces the ability to have the retention time be variable, so that we can compare different values against each other. Instead of a fixed number, we now will be choosing from a list of possible times. So that we can accurately report the expiry time which was used, we're also for now going to cache the time alongside the value itself. (I tried to cache them together as a tuple, but that ended up not working, so instead the time is cached separately with `_expiry` added onto the end of the cache key.) Once we find the happy medium between hit rate and storage space, we can back out these changes and go back to using a fixed number. I've therefore left TODOs explaining how to do that so it will be easy to switch back.
When we began to talk about caching grouphashes during ingest, one of the considerations was how much space in the cache we'd end up using. To give ourselves a ceiling, we took the worst-case scenario and made sure it wasn't going to be a problem. According to ops it wasn't, so we proceeded.
In that worst-case scenario, we never get a cache hit, every grouphash is eligible to be cached (has a group already assigned), and every event has two different grouphashes associated with it. In that case, every event we see would lead to two new hashes being added to the cache. We knew that wasn't actually going to be true - we definitely will get hits, we definitely will encounter new grouphashes (which we won't put in the cache because they aren't yet assigned to a group), and not every event has two grouphashes. That said, until we started doing the actual caching, we didn't know just how far from the truth that worst-case scenario was going to be.
Now that we have rolled it out, we can see that in fact hashes are being added to the cache only about 12% of the time. This means we can safely play around with longer cache retention times to see if it will meaningfully improve our hit rate. To make such experimentation easier, this PR introduces the ability to have the retention time be variable, so that we can compare different values against each other. Instead of a fixed number, we now will be choosing from a list of possible times. So that we can accurately report the expiry time which was used, we're also for now going to cache the time alongside the value itself. (I tried to cache them together as a tuple, but that ended up not working, so instead the time is cached separately with
_expiryadded onto the end of the cache key.)Once we find the happy medium between hit rate and storage space, we can back out these changes and go back to using a fixed number. I've therefore left TODOs explaining how to do that so it will be easy to switch back.