-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
chore: Split explorer index projects over 23h #106563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Split explorer index projects over 23h #106563
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| continue | ||
|
|
||
| if project.id % 23 != django_timezone.now().hour: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hour sharding uses modulo 23 but hours range 0-23
Medium Severity
The sharding logic project.id % 23 != django_timezone.now().hour has an off-by-one mismatch. project.id % 23 produces values 0-22 (23 distinct values), but django_timezone.now().hour produces values 0-23 (24 distinct values). When the current hour is 23, no projects will ever match because no project ID modulo 23 can equal 23. This means projects are never indexed during hour 23 (11pm) each day, creating a gap in the indexing schedule.
| continue | ||
|
|
||
| if project.id % 23 != django_timezone.now().hour: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hour check evaluated repeatedly can cause race condition
Medium Severity
The django_timezone.now().hour is called fresh for each project inside the RangeQuerySetWrapper loop. If the iteration takes time and crosses an hour boundary, projects evaluated before the boundary are checked against hour N, while projects evaluated after are checked against hour N+1. This causes some shard-N projects to be skipped (not processed for another 24 hours) and some shard-N+1 projects to be processed early (then again at the next scheduled run). The hour value needs to be captured once before the loop begins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unlikely to happen in prod but probably good practice in case it cases instability in tests.
| Tuple of (project_id, organization_id) | ||
| """ | ||
| projects = Project.objects.filter(status=ObjectStatus.ACTIVE).select_related("organization") | ||
| current_hour = django_timezone.now().hour | ||
|
|
||
| for project in RangeQuerySetWrapper( | ||
| projects, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The sharding logic project.id % 23 will never equal current_hour when the hour is 23, causing projects for that hour to be skipped.
Severity: MEDIUM
Suggested Fix
To ensure all 24 hours are utilized, change the modulo divisor from 23 to 24. The condition should be updated to project.id % 24 == current_hour.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/tasks/seer_explorer_index.py#L40-L46
Potential issue: The project filtering logic uses the expression `project.id % 23 ==
current_hour` to distribute the indexing workload across 23 hours of the day. However,
when the `current_hour` is 23, this condition can never be met because the result of a
modulo 23 operation is always an integer between 0 and 22, inclusive. As a result, any
projects that would have been scheduled for the 23rd hour will be skipped, and their
data will not be indexed by the Seer service. This is a silent failure that affects a
subset of projects.
Did we get this right? 👍 / 👎 to inform future reviews.
| continue | ||
|
|
||
| if project.id % 23 != django_timezone.now().hour: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unlikely to happen in prod but probably good practice in case it cases instability in tests.
Flagpole evaluations take about 10ms, which lets us evaluate about 60k projects
in 10 min. This is not fast enough. This PR is doing a couple things: