Adds categories to SuggestedResources and Patterns #213

JPrevost · 2025-04-23T17:49:52Z

Why are these changes being introduced:

Expand the categorization process to account for matches by the SuggestedResource and SuggestedPattern Detectors

Relevant ticket(s):

How does this address that need:

Adds ActiveRecord associations to Category for SuggestedPattern
Adds ActiveRecord associations to Category for SuggestedResource
Adds limited Dashboard for managing SuggestedResources. Only allows editing of Category to prevent extremely slow loading and poor UI if we allow editing linked Terms (i.e. we'll need custom UI eventually but for now will manage Terms for SuggestedResources in the console)
Adds Dashboard for SuggestedPatterns
Adds logic in Term model to add confidence scores for SuggestedPattern and SuggestedResource which are quite different from other Detectors so they could not follow the same model (they work on a per record basis rather than a Detector basis)

Document any side effects to this change:

Changes unique index on Categorization to include confidence in order to address an edge case that prevented new Categorizations to be made even when one was not found

Summary of changes (please refer to commit messages for full details)

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

Why are these changes being introduced: * Expand the categorization process to account for matches by the SuggestedResource and SuggestedPattern Detectors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-91 * https://mitlibraries.atlassian.net/browse/TCO-145 (partial) How does this address that need: * Adds ActiveRecord associations to Category for SuggestedPattern * Adds ActiveRecord associations to Category for SuggestedResource * Adds limited Dashboard for managing SuggestedResources. Only allows editing of Category to prevent extremely slow loading and poor UI if we allow editing linked Terms (i.e. we'll need custom UI eventually but for now will manage Terms for SuggestedResources in the console) * Adds Dashboard for SuggestedPatterns * Adds logic in Term model to add confidence scores for SuggestedPattern and SuggestedResource which are quite different from other Detectors so they could not follow the same model (they work on a per record basis rather than a Detector basis) Document any side effects to this change: * Changes unique index on Categorization to include confidence in order to address an edge case that prevented new Categorizations to be made even when one was not found

matt-bernhardt

I think I've understood almost all of what's going on here, and it seems generally okay. I've got three questions, but am not sure that they rise to the level of requesting changes.

Can you explain a bit more about the edge case which motivates the second migration (changing the uniqueness constraint on categorization records)?
Given the logic changes around handling confidence scores in the term model, do we need any changes or additions to the tests for that method? I see the new tests around categorizations being created, but nothing addressing the ways that the scores are calculated.
By assigning confidence values in code for all Suggested* records at 0.9, we're now putting these values in two places - as database records on the DetectorCategories model, and in code for Suggested* records. I'm slightly concerned at storing scores like this in two different places, and at two levels of detail (here we're assigning a blanket score for each Suggested* detector across all categories, while other Detectors have the option of getting scores specific to each category)

matt-bernhardt · 2025-04-25T18:56:14Z

db/migrate/20250423125733_update_categorization_indexes.rb

+    remove_index :categorizations, [:term_id, :category_id, :detector_version]
+    remove_index :categorizations, [:category_id, :term_id, :detector_version]
+
+    add_index :categorizations, [:term_id, :category_id, :confidence, :detector_version], unique: true


Making sure that I'm following along...

This change is made because the previous uniqueness constraint on the combination of Terms and Categories is now too restrictive, because a Term can end up being joined to a category multiple times (via a traditional Detector and a SuggestedResource for example?)

I'm asking because I had thought we resolved the concern about multiple potential linkages between Terms and Categories by calculating an overall confidence value across all detections in the calculate_categorizations Term method, and I'm trying to make sure I understand what the gap is that this closes.

I might be misunderstanding what's going on here, however. I see the side effect note about there being an edge case that prevents categorizations from being created, and I'm trying to understand what that is.

Because confidence was not part of the unique index previously, we'd get a state in which there was no record but we could not create one. Changing confidence should not require triggering a DETECTOR_VERSION update as it is not a Detector, it's just our interpretation of what a Detection means. No code change in a detector should not require a version bump is another way to look at this.

I'm still not sure I totally understand this, but I think I follow it enough not to want to block the PR based on this. If something crystallizes for me down the road and I think there's a problem, we can work it out then.

Also feel free to bring it up in a 1:1 at some point as it may be easier to talk through synchronously :)

test/models/term_test.rb

app/models/term.rb

JPrevost · 2025-04-29T13:55:54Z

Re your question 3: I didn't like storing scores in multiple places either, and until you wrote it like that I didn't realize we could just add a confidence field to each SuggestedPattern/SuggestedResource. It's a bit awkward because it feels unlikely to change across each Pattern/Resource but maybe that can be addressed by setting a default value which would both make this consistent with other Detectors and allow for saying "this one particular SuggestedResource is pretty sketch and should be low confidence".

Why are these changes being introduced: * Storing values in the models rather than in the code better aligns with confidence from detectors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-91 How does this address that need: * Adds confidence fields for SuggestedPattern and SuggestedResources Document any side effects to this change: * It feels clunky to assign confidence to each Resource or Pattern. From a data model perspective if feels okay. From a practical standpoint I'm nervous the people managing Resources and Patterns may not know what value is appropriate.

JPrevost · 2025-04-30T19:39:14Z

@matt-bernhardt I've added a commit that moves confidence into the models. This allows us to control on a per record level, but is defaulted to .90 (which is the value I had initially in the code). If no category is assigned, a resource is not categorized and the confidence (regardless of what it is set at) does nothing. I find this a bit awkward, but I'm not sure I have a different idea at this time.

matt-bernhardt · 2025-05-05T17:18:06Z

I like this arrangement better, particularly the consistency in handling confidence values outside of code. Having a starting point defined via default value (which we expect to rely heavily on) is fine too.

I agree that there's an awkwardness about what happens in a category isn't defined, but I see the tests to confirm that we know how that will be handled should it arise.

Thanks -

matt-bernhardt

Thanks - this looks good to me.

JPrevost requested review from jazairi and matt-bernhardt April 23, 2025 17:51

mitlib temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 17:51 Inactive

JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 17:59 Inactive

JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 18:00 Inactive

JPrevost force-pushed the record-level-categorizations branch from fadc2b4 to c1d50ec Compare April 24, 2025 12:18

JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 24, 2025 12:18 Inactive

matt-bernhardt self-assigned this Apr 24, 2025

matt-bernhardt reviewed Apr 25, 2025

View reviewed changes

JPrevost added 2 commits April 29, 2025 16:22

Fixup test names

4d5ed34

JPrevost requested a review from matt-bernhardt April 30, 2025 19:37

matt-bernhardt approved these changes May 5, 2025

View reviewed changes

JPrevost merged commit abe9015 into main May 5, 2025
6 checks passed

JPrevost deleted the record-level-categorizations branch May 5, 2025 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds categories to SuggestedResources and Patterns #213

Adds categories to SuggestedResources and Patterns #213

Uh oh!

JPrevost commented Apr 23, 2025 •

edited

Loading

Uh oh!

matt-bernhardt left a comment

Uh oh!

matt-bernhardt Apr 25, 2025

Uh oh!

JPrevost Apr 29, 2025

Uh oh!

matt-bernhardt May 5, 2025

Uh oh!

JPrevost May 5, 2025

Uh oh!

Uh oh!

Uh oh!

JPrevost commented Apr 29, 2025

Uh oh!

JPrevost commented Apr 30, 2025

Uh oh!

matt-bernhardt commented May 5, 2025

Uh oh!

matt-bernhardt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adds categories to SuggestedResources and Patterns #213

Adds categories to SuggestedResources and Patterns #213

Uh oh!

Conversation

JPrevost commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

JPrevost Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JPrevost May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JPrevost commented Apr 29, 2025

Uh oh!

JPrevost commented Apr 30, 2025

Uh oh!

matt-bernhardt commented May 5, 2025

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JPrevost commented Apr 23, 2025 •

edited

Loading