Skip to content

Conversation

@JPrevost
Copy link
Member

@JPrevost JPrevost commented Apr 23, 2025

Why are these changes being introduced:

  • Expand the categorization process to account for matches by the SuggestedResource and SuggestedPattern Detectors

Relevant ticket(s):

How does this address that need:

  • Adds ActiveRecord associations to Category for SuggestedPattern
  • Adds ActiveRecord associations to Category for SuggestedResource
  • Adds limited Dashboard for managing SuggestedResources. Only allows editing of Category to prevent extremely slow loading and poor UI if we allow editing linked Terms (i.e. we'll need custom UI eventually but for now will manage Terms for SuggestedResources in the console)
  • Adds Dashboard for SuggestedPatterns
  • Adds logic in Term model to add confidence scores for SuggestedPattern and SuggestedResource which are quite different from other Detectors so they could not follow the same model (they work on a per record basis rather than a Detector basis)

Document any side effects to this change:

  • Changes unique index on Categorization to include confidence in order to address an edge case that prevented new Categorizations to be made even when one was not found

Summary of changes (please refer to commit messages for full details)

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 17:51 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 17:59 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 23, 2025 18:00 Inactive
Why are these changes being introduced:

* Expand the categorization process to account for matches by the
  SuggestedResource and SuggestedPattern Detectors

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-91
* https://mitlibraries.atlassian.net/browse/TCO-145 (partial)

How does this address that need:

* Adds ActiveRecord associations to Category for SuggestedPattern
* Adds ActiveRecord associations to Category for SuggestedResource
* Adds limited Dashboard for managing SuggestedResources. Only allows
  editing of Category to prevent extremely slow loading and poor UI
  if we allow editing linked Terms (i.e. we'll need custom UI eventually
  but for now will manage Terms for SuggestedResources in the console)
* Adds Dashboard for SuggestedPatterns
* Adds logic in Term model to add confidence scores for SuggestedPattern
  and SuggestedResource which are quite different from other Detectors
  so they could not follow the same model (they work on a per record
  basis rather than a Detector basis)

Document any side effects to this change:

* Changes unique index on Categorization to include confidence in order
  to address an edge case that prevented new Categorizations to be made
  even when one was not found
@JPrevost JPrevost force-pushed the record-level-categorizations branch from fadc2b4 to c1d50ec Compare April 24, 2025 12:18
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-213 April 24, 2025 12:18 Inactive
@matt-bernhardt matt-bernhardt self-assigned this Apr 24, 2025
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've understood almost all of what's going on here, and it seems generally okay. I've got three questions, but am not sure that they rise to the level of requesting changes.

  1. Can you explain a bit more about the edge case which motivates the second migration (changing the uniqueness constraint on categorization records)?
  2. Given the logic changes around handling confidence scores in the term model, do we need any changes or additions to the tests for that method? I see the new tests around categorizations being created, but nothing addressing the ways that the scores are calculated.
  3. By assigning confidence values in code for all Suggested* records at 0.9, we're now putting these values in two places - as database records on the DetectorCategories model, and in code for Suggested* records. I'm slightly concerned at storing scores like this in two different places, and at two levels of detail (here we're assigning a blanket score for each Suggested* detector across all categories, while other Detectors have the option of getting scores specific to each category)

remove_index :categorizations, [:term_id, :category_id, :detector_version]
remove_index :categorizations, [:category_id, :term_id, :detector_version]

add_index :categorizations, [:term_id, :category_id, :confidence, :detector_version], unique: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure that I'm following along...

This change is made because the previous uniqueness constraint on the combination of Terms and Categories is now too restrictive, because a Term can end up being joined to a category multiple times (via a traditional Detector and a SuggestedResource for example?)

I'm asking because I had thought we resolved the concern about multiple potential linkages between Terms and Categories by calculating an overall confidence value across all detections in the calculate_categorizations Term method, and I'm trying to make sure I understand what the gap is that this closes.


I might be misunderstanding what's going on here, however. I see the side effect note about there being an edge case that prevents categorizations from being created, and I'm trying to understand what that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because confidence was not part of the unique index previously, we'd get a state in which there was no record but we could not create one. Changing confidence should not require triggering a DETECTOR_VERSION update as it is not a Detector, it's just our interpretation of what a Detection means. No code change in a detector should not require a version bump is another way to look at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I totally understand this, but I think I follow it enough not to want to block the PR based on this. If something crystallizes for me down the road and I think there's a problem, we can work it out then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel free to bring it up in a 1:1 at some point as it may be easier to talk through synchronously :)

@JPrevost
Copy link
Member Author

Re your question 3: I didn't like storing scores in multiple places either, and until you wrote it like that I didn't realize we could just add a confidence field to each SuggestedPattern/SuggestedResource. It's a bit awkward because it feels unlikely to change across each Pattern/Resource but maybe that can be addressed by setting a default value which would both make this consistent with other Detectors and allow for saying "this one particular SuggestedResource is pretty sketch and should be low confidence".

Why are these changes being introduced:

* Storing values in the models rather than in the code better aligns
  with confidence from detectors

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-91

How does this address that need:

* Adds confidence fields for SuggestedPattern and SuggestedResources

Document any side effects to this change:

* It feels clunky to assign confidence to each Resource or Pattern. From
  a data model perspective if feels okay. From a practical standpoint
  I'm nervous the people managing Resources and Patterns may not know
  what value is appropriate.
@JPrevost JPrevost requested a review from matt-bernhardt April 30, 2025 19:37
@JPrevost
Copy link
Member Author

@matt-bernhardt I've added a commit that moves confidence into the models. This allows us to control on a per record level, but is defaulted to .90 (which is the value I had initially in the code). If no category is assigned, a resource is not categorized and the confidence (regardless of what it is set at) does nothing. I find this a bit awkward, but I'm not sure I have a different idea at this time.

@matt-bernhardt
Copy link
Member

I like this arrangement better, particularly the consistency in handling confidence values outside of code. Having a starting point defined via default value (which we expect to rely heavily on) is fine too.

I agree that there's an awkwardness about what happens in a category isn't defined, but I see the tests to confirm that we know how that will be handled should it arise.

Thanks - :shipit:

Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks good to me.

:shipit:

@JPrevost JPrevost merged commit abe9015 into main May 5, 2025
6 checks passed
@JPrevost JPrevost deleted the record-level-categorizations branch May 5, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants