Skip to content

Conversation

@lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented Dec 5, 2025

When we began to talk about caching grouphashes during ingest, one of the considerations was how much space in the cache we'd end up using. To give ourselves a ceiling, we took the worst-case scenario and made sure it wasn't going to be a problem. According to ops it wasn't, so we proceeded.

In that worst-case scenario, we never get a cache hit, every grouphash is eligible to be cached (has a group already assigned), and every event has two different grouphashes associated with it. In that case, every event we see would lead to two new hashes being added to the cache. We knew that wasn't actually going to be true - we definitely will get hits, we definitely will encounter new grouphashes (which we won't put in the cache because they aren't yet assigned to a group), and not every event has two grouphashes. That said, until we started doing the actual caching, we didn't know just how far from the truth that worst-case scenario was going to be.

Now that we have rolled it out, we can see that in fact hashes are being added to the cache only about 12% of the time. This means we can safely play around with longer cache retention times to see if it will meaningfully improve our hit rate. To make such experimentation easier, this PR introduces the ability to have the retention time be variable, so that we can compare different values against each other. Instead of a fixed number, we now will be choosing from a list of possible times. So that we can accurately report the expiry time which was used, we're also for now going to cache the time alongside the value itself. (I tried to cache them together as a tuple, but that ended up not working, so instead the time is cached separately with _expiry added onto the end of the cache key.)

Once we find the happy medium between hit rate and storage space, we can back out these changes and go back to using a fixed number. I've therefore left TODOs explaining how to do that so it will be easy to switch back.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 5, 2025
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #104460   +/-   ##
========================================
  Coverage   80.51%    80.51%           
========================================
  Files        9348      9348           
  Lines      399919    399929   +10     
  Branches    25651     25651           
========================================
+ Hits       322005    322022   +17     
+ Misses      77466     77459    -7     
  Partials      448       448           

def _get_cache_expiry(cache_key: str, cache_type: Literal["existence", "object"]) -> int:
option_name = f"grouping.ingest_grouphash_{cache_type}_cache_expiry.trial_values"
possible_cache_expiries = options.get(option_name)
return possible_cache_expiries[hash(cache_key) % len(possible_cache_expiries)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're currently choosing the experimental group by hashing on cache_key, which is group-level granularity. This means that a user / project / organization could hit different experimental groups for different groups, which doesn't seem ideal... but given that (1) different organizations could be different "hotness" levels, which could skew our numbers, and (2) this is a fairly backend, shared cache, it should be fine.

TL;DR I talked myself out of my proposed change and think this is good now. Leaving comment for posterity. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Honestly, I could have just done it as a die-roll - math.floor(random() * 3) or something - because it really shouldn't depend on project/org/etc, for exactly the reason you state: that might introduce other variables into the experiment, because not all projects/orgs/etc are created equal. (It's even interesting to see that not even all regions are created equal - some have consistently higher hit rates than others.)

@lobsterkatie lobsterkatie marked this pull request as ready for review December 5, 2025 17:59
@lobsterkatie lobsterkatie requested a review from a team as a code owner December 5, 2025 17:59
@lobsterkatie lobsterkatie merged commit 5e34163 into master Dec 5, 2025
68 checks passed
@lobsterkatie lobsterkatie deleted the kmclb-use-varying-grouphash-cache-retention branch December 5, 2025 17:59
Comment on lines +213 to +216
def _get_cache_expiry(cache_key: str, cache_type: Literal["existence", "object"]) -> int:
option_name = f"grouping.ingest_grouphash_{cache_type}_cache_expiry.trial_values"
possible_cache_expiries = options.get(option_name)
return possible_cache_expiries[hash(cache_key) % len(possible_cache_expiries)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: _get_cache_expiry() can cause ZeroDivisionError if grouping.ingest_grouphash_cache_expiry.trial_values option is an empty list.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

The _get_cache_expiry() function in src/sentry/grouping/ingest/hashing.py can raise a ZeroDivisionError if the grouping.ingest_grouphash_cache_expiry.trial_values option is configured as an empty list. This occurs because the modulo operation hash(cache_key) % len(possible_cache_expiries) attempts to divide by zero when len(possible_cache_expiries) is 0. The option is modifiable at runtime and lacks validation to prevent empty sequences, leading to an unhandled crash during event ingest.

💡 Suggested Fix

Add validation to the option to prevent empty sequences, or implement a defensive length check before the modulo operation in _get_cache_expiry().

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/grouping/ingest/hashing.py#L213-L216

Potential issue: The `_get_cache_expiry()` function in
`src/sentry/grouping/ingest/hashing.py` can raise a `ZeroDivisionError` if the
`grouping.ingest_grouphash_cache_expiry.trial_values` option is configured as an empty
list. This occurs because the modulo operation `hash(cache_key) %
len(possible_cache_expiries)` attempts to divide by zero when
`len(possible_cache_expiries)` is 0. The option is modifiable at runtime and lacks
validation to prevent empty sequences, leading to an unhandled crash during event
ingest.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5860292

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, except that ALLOW_EMPTY isn't set on the option, so it never will be empty. So good instincts, Seer, but not applicable in this case.

lobsterkatie added a commit that referenced this pull request Dec 9, 2025
…4537)

Currently, we only tag our `grouping.get_or_create_grouphashes.check_secondary_hash_existence` and `grouping.get_or_create_grouphashes.get_or_create_grouphash` metrics with the cache expiry time in cases where there's a cache hit. When there was only one possible expiry time, that worked fine, because anything that was a cache miss also can be assumed to have used that same expiry period. However, now that we're testing multiple expiry times at once[1], it's impossible to know which expiry time led to a given miss, which in turn throws off our hit-rate calculations.

To fix that problem, this PR adds the `expiry_seconds` tag to misses as well. Since by definition misses don't find anything in the cache, this also switches from using a cached expiry value to using the current expiry value. Finally, so that using the current value doesn't pollute results based on hits stored with an old value, it also introduces the use of cache versioning. That way, any time the list of potential cache times changes, no previously-stored values will be found.


[1] #104460
ryan953 pushed a commit that referenced this pull request Dec 9, 2025
When we began to talk about caching grouphashes during ingest, one of the considerations was how much space in the cache we'd end up using. To give ourselves a ceiling, we took the worst-case scenario and made sure it wasn't going to be a problem. According to ops it wasn't, so we proceeded. 

In that worst-case scenario, we never get a cache hit, every grouphash is eligible to be cached (has a group already assigned), and every event has two different grouphashes associated with it. In that case, every event we see would lead to two new hashes being added to the cache. We knew that wasn't actually going to be true - we definitely _will_ get hits, we definitely _will_ encounter new grouphashes (which we won't put in the cache because they aren't yet assigned to a group), and not every event has two grouphashes. That said, until we started doing the actual caching, we didn't know just how far from the truth that worst-case scenario was going to be. 

Now that we have rolled it out, we can see that in fact hashes are being added to the cache only about 12% of the time. This means we can safely play around with longer cache retention times to see if it will meaningfully improve our hit rate. To make such experimentation easier, this PR introduces the ability to have the retention time be variable, so that we can compare different values against each other. Instead of a fixed number, we now will be choosing from a list of possible times. So that we can accurately report the expiry time which was used, we're also for now going to cache the time alongside the value itself. (I tried to cache them together as a tuple, but that ended up not working, so instead the time is cached separately with `_expiry` added onto the end of the cache key.)

Once we find the happy medium between hit rate and storage space, we can back out these changes and go back to using a fixed number. I've therefore left TODOs explaining how to do that so it will be easy to switch back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants