-
Notifications
You must be signed in to change notification settings - Fork 0
Adds categories to SuggestedResources and Patterns #213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why are these changes being introduced: * Expand the categorization process to account for matches by the SuggestedResource and SuggestedPattern Detectors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-91 * https://mitlibraries.atlassian.net/browse/TCO-145 (partial) How does this address that need: * Adds ActiveRecord associations to Category for SuggestedPattern * Adds ActiveRecord associations to Category for SuggestedResource * Adds limited Dashboard for managing SuggestedResources. Only allows editing of Category to prevent extremely slow loading and poor UI if we allow editing linked Terms (i.e. we'll need custom UI eventually but for now will manage Terms for SuggestedResources in the console) * Adds Dashboard for SuggestedPatterns * Adds logic in Term model to add confidence scores for SuggestedPattern and SuggestedResource which are quite different from other Detectors so they could not follow the same model (they work on a per record basis rather than a Detector basis) Document any side effects to this change: * Changes unique index on Categorization to include confidence in order to address an edge case that prevented new Categorizations to be made even when one was not found
fadc2b4 to
c1d50ec
Compare
matt-bernhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've understood almost all of what's going on here, and it seems generally okay. I've got three questions, but am not sure that they rise to the level of requesting changes.
- Can you explain a bit more about the edge case which motivates the second migration (changing the uniqueness constraint on categorization records)?
- Given the logic changes around handling confidence scores in the term model, do we need any changes or additions to the tests for that method? I see the new tests around categorizations being created, but nothing addressing the ways that the scores are calculated.
- By assigning confidence values in code for all Suggested* records at 0.9, we're now putting these values in two places - as database records on the DetectorCategories model, and in code for Suggested* records. I'm slightly concerned at storing scores like this in two different places, and at two levels of detail (here we're assigning a blanket score for each Suggested* detector across all categories, while other Detectors have the option of getting scores specific to each category)
| remove_index :categorizations, [:term_id, :category_id, :detector_version] | ||
| remove_index :categorizations, [:category_id, :term_id, :detector_version] | ||
|
|
||
| add_index :categorizations, [:term_id, :category_id, :confidence, :detector_version], unique: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure that I'm following along...
This change is made because the previous uniqueness constraint on the combination of Terms and Categories is now too restrictive, because a Term can end up being joined to a category multiple times (via a traditional Detector and a SuggestedResource for example?)
I'm asking because I had thought we resolved the concern about multiple potential linkages between Terms and Categories by calculating an overall confidence value across all detections in the calculate_categorizations Term method, and I'm trying to make sure I understand what the gap is that this closes.
I might be misunderstanding what's going on here, however. I see the side effect note about there being an edge case that prevents categorizations from being created, and I'm trying to understand what that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because confidence was not part of the unique index previously, we'd get a state in which there was no record but we could not create one. Changing confidence should not require triggering a DETECTOR_VERSION update as it is not a Detector, it's just our interpretation of what a Detection means. No code change in a detector should not require a version bump is another way to look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I totally understand this, but I think I follow it enough not to want to block the PR based on this. If something crystallizes for me down the road and I think there's a problem, we can work it out then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feel free to bring it up in a 1:1 at some point as it may be easier to talk through synchronously :)
|
Re your question 3: I didn't like storing scores in multiple places either, and until you wrote it like that I didn't realize we could just add a confidence field to each SuggestedPattern/SuggestedResource. It's a bit awkward because it feels unlikely to change across each Pattern/Resource but maybe that can be addressed by setting a default value which would both make this consistent with other Detectors and allow for saying "this one particular SuggestedResource is pretty sketch and should be low confidence". |
Why are these changes being introduced: * Storing values in the models rather than in the code better aligns with confidence from detectors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-91 How does this address that need: * Adds confidence fields for SuggestedPattern and SuggestedResources Document any side effects to this change: * It feels clunky to assign confidence to each Resource or Pattern. From a data model perspective if feels okay. From a practical standpoint I'm nervous the people managing Resources and Patterns may not know what value is appropriate.
|
@matt-bernhardt I've added a commit that moves confidence into the models. This allows us to control on a per record level, but is defaulted to .90 (which is the value I had initially in the code). If no category is assigned, a resource is not categorized and the confidence (regardless of what it is set at) does nothing. I find this a bit awkward, but I'm not sure I have a different idea at this time. |
|
I like this arrangement better, particularly the consistency in handling confidence values outside of code. Having a starting point defined via default value (which we expect to rely heavily on) is fine too. I agree that there's an awkwardness about what happens in a category isn't defined, but I see the tests to confirm that we know how that will be handled should it arise. Thanks - |
matt-bernhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks good to me.
![]()
Why are these changes being introduced:
Relevant ticket(s):
How does this address that need:
Document any side effects to this change:
Summary of changes (please refer to commit messages for full details)
Developer
Ticket(s)
https://mitlibraries.atlassian.net/browse/TCO-###
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
YES migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing