Add term features and counts to GraphQL and models #264

JPrevost · 2025-07-23T17:35:59Z

Why are these changes being introduced:

To support new term-related features and counts in the GraphQL API and backend models.
To allow Notebooks to access Tacos extracted features directly via GraphQL.

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/TCO-163

How does this address that need:

Adds CountsType and FeaturesType GraphQL types.
Updates SearchEventType and TermType to expose new fields.
Enhances detector models (citation, journal, standard_identifiers) to return data
Adds .features method to Term model

Document any side effects to this change:

Updates development environment config and test fixtures to be less verbose
StandardIdentifer based features might be appropriate to be boolean instead of a string to reflect typical features in machine learning. However, I wanted to allow the team to see this initial implementation and provide feedback before making that decision.

Summary of changes (please refer to commit messages for full details)

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

Why are these changes being introduced: * To support new term-related features and counts in the GraphQL API and backend models. * To allow Notebooks to access Tacos extracted features directly via GraphQL. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-163 How does this address that need: * Adds `CountsType` and `FeaturesType` GraphQL types. * Updates `SearchEventType` and `TermType` to expose new fields. * Enhances `detector` models (`citation`, `journal`, `standard_identifiers`) to return data * Adds `.features` method to Term model Document any side effects to this change: * Updates development environment config and test fixtures to be less verbose * StandardIdentifer based features might be appropriate to be boolean instead of a string to reflect typical features in machine learning. However, I wanted to allow the team to see this initial implementation and provide feedback before making that decision.

matt-bernhardt

I like the direction this is heading, but I think there are some changes needed before merging.

The biggest issue is that adding a quotation_marks element to Detector::Citation will break the integration with the lambda, unless we also extend the extract_features method in Detector::MlCitation to strip it out (the same way we drop the characters feature). My preference would be to leave this change out of this PR, and solely focus on the GraphQL changes - but I'm open to being convinced that we should add it here.
There are some tweaks to the GraphQL Types that I think are needed to match the structure being defined in the Term.feature method.
I think the Detector::MlCitation class needs its .record() method updated in a similar fashion to the other three classes - without this, I don't think we're ever going to get a non-nil response back from the GraphQL endpoint.

Beyond the requested changes above, I also would like to consider whether we need tests for the various detector .record() methods that are returning something real. I don't feel so strongly about this that it's a requested change however.

There are a handful of other questions in the details, but I think what I've written above covers the most important bits. In general, though, I think this is a good change to be working toward, and I'm excited to see it take shape.

matt-bernhardt · 2025-07-23T20:53:58Z

app/models/detector/journal.rb

      )

-      nil
+      result


Question: This comment applies equally to all three detector subclasses - it looks like we're able to change these returns from the .record() methods, going from returning nil to returning a value, and no tests need to be updated? This makes sense when I stop and think that the old approach just returned nil in all cases, so there was no behavior to test - but it feels a bit strange to me now that there's something coming back and we don't have a test for it.

We do have a test for it, but admittedly they are arguably side effects in that we test the .feature gets populated as expected which comes from all of these new returns. I'll think about whether explicit tests make sense or not.

app/models/detector/journal.rb

matt-bernhardt · 2025-07-24T14:16:04Z

app/models/detector/citation.rb

      lastnames: /[A-Z][a-z]+[.,]/,
-      quotes: /&quot;.*?&quot;/
+      quotes: /&quot;.*?&quot;/,
+      quotation_marks: /".*"/


Requested change: This appears to break integration with the lambda, as it results in an unexpected feature being presented in the payload. If I delete the existing cassettes and re-run the test suite (with the lambda running locally), I see error messages in the cassettes: Additional properties are not allowed (''quotation_marks'' was unexpected). I see the same behavior when I hook up the review app to the Dev1 lambda.

This is an aside, but we might need to tweak our test setup to alert that cassettes need regenerated (probably including consideration of the payload and not just the URL being targeted).

Repeating what I said in slack here for the record. This was not intended to make this PR. Will remove it an open a ticket for us to consider adding it. I really want a "any types of quotes" rollup rather than two separate checks.

matt-bernhardt · 2025-07-24T14:21:52Z

app/models/detector/citation.rb

    def self.record(term)
      cit = Detector::Citation.new(term.phrase)
-      return unless cit.detection?
+      return cit.features unless cit.detection?


Question: Seeing that the guard clause returns the same thing as the ultimate return value makes me wonder whether the guard clause is needed. Is it preferable to repeat a return statement in multiple places, or to wrap the Detection.find_or_create_by block in a conditional of some sort?

Good point. This felt a bit ick to me but I didn't give it enough thought on how to avoid it. I'll take a closer look.

app/models/detector/citation.rb

matt-bernhardt · 2025-07-24T17:31:46Z

app/graphql/types/features_type.rb

+    field :isbn, String, null: true
+    field :issn, String, null: true
+    field :journal, String, null: true
+    field :ml_citation, Boolean, null: true


I'm not sure whether I'm getting the expected value back from this in GraphQL or not. When I try submitting the search term from the "citation with all the things" fixture, I see everything come back except a null for the mlCitation value (and the journal value, but I don't have those populated locally).

Looking a bit more, I think the null is happening because the Detector::MlCitation class' .record method is still set to return nil in this PR - so if we extend the pattern you're using here for citation to ml_citation, that might resolve this.

Yeah, I never testing the ml_citation thing as I was too lazy to figure out how to actually run that locally. I'll take a look to see how to make it work as expected.

I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...

app/graphql/types/features_type.rb

app/models/term.rb

test/models/term_test.rb

matt-bernhardt · 2025-07-24T17:53:36Z

test/models/term_test.rb

+    assert_not_nil features[:isbn]
+    assert_not_nil features[:issn]
+    assert_nil features[:journal]
+    assert_nil features[:ml_citation]


It feels like we need a test to cover the expected cases where this value is not nil, and I don't see one here? Or should this assertion instead be assert_not_nil ?

I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.

This was not intended to be in this changeset

Also adds descriptions

JPrevost · 2025-07-28T14:18:22Z

@matt-bernhardt I've made some changes based on your review (solid feedback thanks!). I'm not entirely confident how to test the ml_citation populates and was hoping you could help me understand how to test that. (Slack/zoom?)

I'm also no longer fond of the self.record method names for all of our Detectors now that those methods also return extracted features. . I'm proposing we leave the poor name in place for now and sit with it for a bit, but am open to changes.

JPrevost · 2025-07-30T15:28:04Z

app/graphql/types/counts_type.rb

+
+module Types
+  class CountsType < Types::BaseObject
+    description 'Features extracted from the input term that are counts. Useful for machine learning.'


Add LCSH counts ticket to return to later

This makes a handful of changes that Jeremy and I discussed while we met to discuss the PR: - Updating the return from Detector::MlCitation.record method to focus on the boolean, rather than the Detector::MlCitation instance. - Adding a test to the Term suite to focus on how a citation will cause the features[:ml_citation] element to be not nil - Wrapping the test of all extracted features values in a block to enable the citation detector lambda, and a cassette to capture its response. - Adding the cassette referenced above

matt-bernhardt

Thanks for the conversation earlier today about this PR. Your latest round of changes resolved a lot of my concerns, and the commit I just pushed should wrap up the last remaining loose ends that we reviewed.

From my perspective, this PR should now be ready to merge - but please check my work on the last commit as well.

matt-bernhardt · 2025-07-30T14:11:46Z

app/graphql/types/counts_type.rb

-    field :pages, Integer, null: true
+    field :lastnames, Integer, null: true, description: 'Count of lastnames in the input. Not recommended for use in production.'
+    field :no, Integer, null: true, description: 'Count of `no` in the input'
+    field :pages, Integer, null: true, description: 'Count of `pages` in the input'


This feature isn't looking for pages but page ranges a la 143-252. I'm not sure whether that distinction is worth changing this sentence for, however?

I'd like to eventually make these docs more transparent as to what the actual detector is doing but I couldn't think of a way to do that in a sustainable way.

matt-bernhardt · 2025-07-30T14:50:10Z

app/graphql/types/features_type.rb

+    field :isbn, String, null: true
+    field :issn, String, null: true
+    field :journal, String, null: true
+    field :ml_citation, Boolean, null: true


I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...

matt-bernhardt · 2025-07-30T14:59:04Z

app/models/detector/ml_citation.rb

+        )
+      end
+
+      result


result at this moment isn't a boolean, but an instance of Detector::MlCitation. This seems to be causing problems later in the application's response processing, such that the GraphQL endpoint is always returning True for the ml_citation result.

It feels like the better response would be result.detections (although the plural feels awkward to me in this context)

I've pushed a commit to make this change after our discussion earlier today.

matt-bernhardt · 2025-07-30T15:00:32Z

app/models/term.rb

-    Detector::Citation.record(self)
-    Detector::StandardIdentifiers.record(self)
-    Detector::Journal.record(self)
+    @ml_citation = Detector::MlCitation.record(self) if Detector::MlCitation.expected_env?


This seems fine, with the caveat that the response from Detector::MlCitation.record(self) is currently not a boolean but an instance of Detector::MlCitation it seems. I think the right place to tweak is within that class (see comment above), so what's here should be fine.

matt-bernhardt · 2025-07-30T17:28:45Z

test/models/term_test.rb

+    assert_not_nil features[:isbn]
+    assert_not_nil features[:issn]
+    assert_nil features[:journal]
+    assert_nil features[:ml_citation]


I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.

JPrevost · 2025-07-30T18:01:06Z

@matt-bernhardt thanks for the patient review, code fix, and extra tests!

sentry · 2025-08-05T08:25:27Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ Non-10 or 13 digit sequence detected as ISBN in tacos-prod

_{Did you find this useful? React with a 👍 or 👎}

mitlib temporarily deployed to tacos-api-pi-tco-163-te-saosbk July 23, 2025 17:37 Inactive

JPrevost closed this Jul 23, 2025

JPrevost reopened this Jul 23, 2025

mitlib temporarily deployed to tacos-api-pipeline-pr-264 July 23, 2025 17:52 Inactive

JPrevost requested review from jazairi and matt-bernhardt July 23, 2025 18:08

matt-bernhardt self-assigned this Jul 23, 2025

matt-bernhardt requested changes Jul 24, 2025

View reviewed changes

JPrevost added 4 commits July 28, 2025 09:10

Removes additional quote detection

d3c49de

This was not intended to be in this changeset

Remove recursive Type unintended fields

3819f44

Also adds descriptions

Adds barcode to features type

c5a0615

Rework citation.record conditional

cc3656e

JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 13:30 Inactive

Adds tests for Citation record returning data

d037ed9

JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 13:58 Inactive

Simplify journal record conditioanal logic

23852a5

JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 14:12 Inactive

Return boolean from ml_citation record

3c54f34

JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 14:14 Inactive

JPrevost commented Jul 30, 2025

View reviewed changes

mitlib temporarily deployed to tacos-api-pipeline-pr-264 July 30, 2025 17:28 Inactive

matt-bernhardt approved these changes Jul 30, 2025

View reviewed changes

JPrevost merged commit 1b0a034 into main Jul 30, 2025
5 checks passed

JPrevost deleted the tco-163-term-features branch July 30, 2025 18:55

                     )
-                    nil
+                    result

+                      )
+                    end
+                    result

Add term features and counts to GraphQL and models #264

Add term features and counts to GraphQL and models #264

Uh oh!

Conversation

JPrevost commented Jul 23, 2025 • edited by matt-bernhardt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JPrevost commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JPrevost commented Jul 30, 2025

Uh oh!

Uh oh!

sentry bot commented Aug 5, 2025

Suspect Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

JPrevost commented Jul 23, 2025 •

edited by matt-bernhardt

Loading