Skip to content

Conversation

@JPrevost
Copy link
Member

@JPrevost JPrevost commented Jul 23, 2025

Why are these changes being introduced:

  • To support new term-related features and counts in the GraphQL API and backend models.
  • To allow Notebooks to access Tacos extracted features directly via GraphQL.

Relevant ticket(s):

How does this address that need:

  • Adds CountsType and FeaturesType GraphQL types.
  • Updates SearchEventType and TermType to expose new fields.
  • Enhances detector models (citation, journal, standard_identifiers) to return data
  • Adds .features method to Term model

Document any side effects to this change:

  • Updates development environment config and test fixtures to be less verbose
  • StandardIdentifer based features might be appropriate to be boolean instead of a string to reflect typical features in machine learning. However, I wanted to allow the team to see this initial implementation and provide feedback before making that decision.

Summary of changes (please refer to commit messages for full details)

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

Why are these changes being introduced:

* To support new term-related features and counts in the GraphQL API and backend models.
* To allow Notebooks to access Tacos extracted features directly via GraphQL.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-163

How does this address that need:

* Adds `CountsType` and `FeaturesType` GraphQL types.
* Updates `SearchEventType` and `TermType` to expose new fields.
* Enhances `detector` models (`citation`, `journal`, `standard_identifiers`)
  to return data
* Adds `.features` method to Term model

Document any side effects to this change:

* Updates development environment config and test fixtures to be less
  verbose
* StandardIdentifer based features might be appropriate to be boolean
  instead of a string to reflect typical features in machine learning.
  However, I wanted to allow the team to see this initial implementation
  and provide feedback before making that decision.
@mitlib mitlib temporarily deployed to tacos-api-pi-tco-163-te-saosbk July 23, 2025 17:37 Inactive
@JPrevost JPrevost closed this Jul 23, 2025
@JPrevost JPrevost reopened this Jul 23, 2025
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-264 July 23, 2025 17:52 Inactive
@matt-bernhardt matt-bernhardt self-assigned this Jul 23, 2025
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is heading, but I think there are some changes needed before merging.

  • The biggest issue is that adding a quotation_marks element to Detector::Citation will break the integration with the lambda, unless we also extend the extract_features method in Detector::MlCitation to strip it out (the same way we drop the characters feature). My preference would be to leave this change out of this PR, and solely focus on the GraphQL changes - but I'm open to being convinced that we should add it here.
  • There are some tweaks to the GraphQL Types that I think are needed to match the structure being defined in the Term.feature method.
  • I think the Detector::MlCitation class needs its .record() method updated in a similar fashion to the other three classes - without this, I don't think we're ever going to get a non-nil response back from the GraphQL endpoint.

Beyond the requested changes above, I also would like to consider whether we need tests for the various detector .record() methods that are returning something real. I don't feel so strongly about this that it's a requested change however.

There are a handful of other questions in the details, but I think what I've written above covers the most important bits. In general, though, I think this is a good change to be working toward, and I'm excited to see it take shape.

)

nil
result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This comment applies equally to all three detector subclasses - it looks like we're able to change these returns from the .record() methods, going from returning nil to returning a value, and no tests need to be updated? This makes sense when I stop and think that the old approach just returned nil in all cases, so there was no behavior to test - but it feels a bit strange to me now that there's something coming back and we don't have a test for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a test for it, but admittedly they are arguably side effects in that we test the .feature gets populated as expected which comes from all of these new returns. I'll think about whether explicit tests make sense or not.

lastnames: /[A-Z][a-z]+[.,]/,
quotes: /".*?"/
quotes: /".*?"/,
quotation_marks: /".*"/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested change: This appears to break integration with the lambda, as it results in an unexpected feature being presented in the payload. If I delete the existing cassettes and re-run the test suite (with the lambda running locally), I see error messages in the cassettes: Additional properties are not allowed (''quotation_marks'' was unexpected). I see the same behavior when I hook up the review app to the Dev1 lambda.

This is an aside, but we might need to tweak our test setup to alert that cassettes need regenerated (probably including consideration of the payload and not just the URL being targeted).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating what I said in slack here for the record. This was not intended to make this PR. Will remove it an open a ticket for us to consider adding it. I really want a "any types of quotes" rollup rather than two separate checks.

def self.record(term)
cit = Detector::Citation.new(term.phrase)
return unless cit.detection?
return cit.features unless cit.detection?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Seeing that the guard clause returns the same thing as the ultimate return value makes me wonder whether the guard clause is needed. Is it preferable to repeat a return statement in multiple places, or to wrap the Detection.find_or_create_by block in a conditional of some sort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This felt a bit ick to me but I didn't give it enough thought on how to avoid it. I'll take a closer look.

field :isbn, String, null: true
field :issn, String, null: true
field :journal, String, null: true
field :ml_citation, Boolean, null: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I'm getting the expected value back from this in GraphQL or not. When I try submitting the search term from the "citation with all the things" fixture, I see everything come back except a null for the mlCitation value (and the journal value, but I don't have those populated locally).

Looking a bit more, I think the null is happening because the Detector::MlCitation class' .record method is still set to return nil in this PR - so if we extend the pattern you're using here for citation to ml_citation, that might resolve this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I never testing the ml_citation thing as I was too lazy to figure out how to actually run that locally. I'll take a look to see how to make it work as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...

assert_not_nil features[:isbn]
assert_not_nil features[:issn]
assert_nil features[:journal]
assert_nil features[:ml_citation]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we need a test to cover the expected cases where this value is not nil, and I don't see one here? Or should this assertion instead be assert_not_nil ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.

@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 13:30 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 13:58 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 14:12 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-264 July 28, 2025 14:14 Inactive
@JPrevost
Copy link
Member Author

@matt-bernhardt I've made some changes based on your review (solid feedback thanks!). I'm not entirely confident how to test the ml_citation populates and was hoping you could help me understand how to test that. (Slack/zoom?)

I'm also no longer fond of the self.record method names for all of our Detectors now that those methods also return extracted features. . I'm proposing we leave the poor name in place for now and sit with it for a bit, but am open to changes.


module Types
class CountsType < Types::BaseObject
description 'Features extracted from the input term that are counts. Useful for machine learning.'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add LCSH counts ticket to return to later

This makes a handful of changes that Jeremy and I discussed while we
met to discuss the PR:

- Updating the return from Detector::MlCitation.record method to focus
  on the boolean, rather than the Detector::MlCitation instance.

- Adding a test to the Term suite to focus on how a citation will cause
  the features[:ml_citation] element to be not nil

- Wrapping the test of all extracted features values in a block to
  enable the citation detector lambda, and a cassette to capture its
  response.

- Adding the cassette referenced above
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-264 July 30, 2025 17:28 Inactive
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the conversation earlier today about this PR. Your latest round of changes resolved a lot of my concerns, and the commit I just pushed should wrap up the last remaining loose ends that we reviewed.

From my perspective, this PR should now be ready to merge - but please check my work on the last commit as well.

field :pages, Integer, null: true
field :lastnames, Integer, null: true, description: 'Count of lastnames in the input. Not recommended for use in production.'
field :no, Integer, null: true, description: 'Count of `no` in the input'
field :pages, Integer, null: true, description: 'Count of `pages` in the input'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature isn't looking for pages but page ranges a la 143-252. I'm not sure whether that distinction is worth changing this sentence for, however?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to eventually make these docs more transparent as to what the actual detector is doing but I couldn't think of a way to do that in a sustainable way.

field :isbn, String, null: true
field :issn, String, null: true
field :journal, String, null: true
field :ml_citation, Boolean, null: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...

)
end

result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result at this moment isn't a boolean, but an instance of Detector::MlCitation. This seems to be causing problems later in the application's response processing, such that the GraphQL endpoint is always returning True for the ml_citation result.

It feels like the better response would be result.detections (although the plural feels awkward to me in this context)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit to make this change after our discussion earlier today.

Detector::Citation.record(self)
Detector::StandardIdentifiers.record(self)
Detector::Journal.record(self)
@ml_citation = Detector::MlCitation.record(self) if Detector::MlCitation.expected_env?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, with the caveat that the response from Detector::MlCitation.record(self) is currently not a boolean but an instance of Detector::MlCitation it seems. I think the right place to tweak is within that class (see comment above), so what's here should be fine.

assert_not_nil features[:isbn]
assert_not_nil features[:issn]
assert_nil features[:journal]
assert_nil features[:ml_citation]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.

@JPrevost
Copy link
Member Author

@matt-bernhardt thanks for the patient review, code fix, and extra tests!

@JPrevost JPrevost merged commit 1b0a034 into main Jul 30, 2025
5 checks passed
@JPrevost JPrevost deleted the tco-163-term-features branch July 30, 2025 18:55
@sentry
Copy link

sentry bot commented Aug 5, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants