Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter reused guid:version strings from not blocked items in mlbf #22812

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Oct 31, 2024

Fixes: mozilla/addons#15132

Description

Ensure filtering of duplicate guid:version strings from not_blocked_items

Context

After merging commit it became clear in stage that we have duplicate hashed version strings in the mlbf data set.

Testing

unit tested to cover the edge case

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

@KevinMind KevinMind requested review from a team, diox and eviljeff and removed request for a team and diox November 1, 2024 08:55
Copy link
Member

@eviljeff eviljeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something for the next time we touch this code (soon!) - we should have a check that there is no common items between the two sets and raise with an explicit error if it happens - we spent way too long wondering if there was some change in filter_cascade dependencies because the error we got from there was an obscure one about error levels, and not "Don't have duplicates!"


(block_version,) = MLBF.hash_filter_inputs([(addon.guid, version)])
assert block_version in mlbf.data.blocked_items
assert block_version not in mlbf.data.not_blocked_items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being paranoid: can we assert that (addon.guid, version) is in fetch_all_versions_from_db(MLBF._version_excludes) ? i.e. would have been included otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not test the underlying function but the final data set. Or am I missing something in your ask?

Copy link
Member

@eviljeff eviljeff Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for a verification the test is set up correctly - that we have a case where there is a guid, version pair that is unblocked, and a pair that is blocked, but they don't both end up in final data set. I suppose closer to a functional test than an integration test.
There are other ways to test that - it was just the first way I thought of.

@KevinMind
Copy link
Contributor Author

Something for the next time we touch this code (soon!) - we should have a check that there is no common items between the two sets and raise with an explicit error if it happens - we spent way too long wondering if there was some change in filter_cascade dependencies because the error we got from there was an obscure one about error levels, and not "Don't have duplicates!"

It's still not entirely clear if this bug is the cause of the error we saw.. I agree with you though we could have a more explicit check that no entry is in multiple arrays.. not totally clear when or where that check should be though.

@KevinMind KevinMind merged commit f4da852 into master Nov 1, 2024
31 checks passed
@KevinMind KevinMind deleted the fix-duplicate-mlbf-items branch November 1, 2024 10:21
@eviljeff
Copy link
Member

eviljeff commented Nov 1, 2024

Something for the next time we touch this code (soon!) - we should have a check that there is no common items between the two sets and raise with an explicit error if it happens - we spent way too long wondering if there was some change in filter_cascade dependencies because the error we got from there was an obscure one about error levels, and not "Don't have duplicates!"

It's still not entirely clear if this bug is the cause of the error we saw..

If I'm right we'll know soon 😄

I agree with you though we could have a more explicit check that no entry is in multiple arrays.. not totally clear when or where that check should be though.

Probably just before cascade.initialize errors in an obscure way, really. If it's gotten that far we've got a bug elsewhere we can't (shouldn't try to) recover from, but at least the error would be comprehensible in sentry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Cannot generate bloom filter (probably due to duplicated entries in the include/exclude data set)
2 participants