-
Notifications
You must be signed in to change notification settings - Fork 0
Add term features and counts to GraphQL and models #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why are these changes being introduced: * To support new term-related features and counts in the GraphQL API and backend models. * To allow Notebooks to access Tacos extracted features directly via GraphQL. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-163 How does this address that need: * Adds `CountsType` and `FeaturesType` GraphQL types. * Updates `SearchEventType` and `TermType` to expose new fields. * Enhances `detector` models (`citation`, `journal`, `standard_identifiers`) to return data * Adds `.features` method to Term model Document any side effects to this change: * Updates development environment config and test fixtures to be less verbose * StandardIdentifer based features might be appropriate to be boolean instead of a string to reflect typical features in machine learning. However, I wanted to allow the team to see this initial implementation and provide feedback before making that decision.
matt-bernhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is heading, but I think there are some changes needed before merging.
- The biggest issue is that adding a
quotation_markselement to Detector::Citation will break the integration with the lambda, unless we also extend theextract_featuresmethod in Detector::MlCitation to strip it out (the same way we drop the characters feature). My preference would be to leave this change out of this PR, and solely focus on the GraphQL changes - but I'm open to being convinced that we should add it here. - There are some tweaks to the GraphQL Types that I think are needed to match the structure being defined in the Term.feature method.
- I think the Detector::MlCitation class needs its .record() method updated in a similar fashion to the other three classes - without this, I don't think we're ever going to get a non-nil response back from the GraphQL endpoint.
Beyond the requested changes above, I also would like to consider whether we need tests for the various detector .record() methods that are returning something real. I don't feel so strongly about this that it's a requested change however.
There are a handful of other questions in the details, but I think what I've written above covers the most important bits. In general, though, I think this is a good change to be working toward, and I'm excited to see it take shape.
| ) | ||
|
|
||
| nil | ||
| result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: This comment applies equally to all three detector subclasses - it looks like we're able to change these returns from the .record() methods, going from returning nil to returning a value, and no tests need to be updated? This makes sense when I stop and think that the old approach just returned nil in all cases, so there was no behavior to test - but it feels a bit strange to me now that there's something coming back and we don't have a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a test for it, but admittedly they are arguably side effects in that we test the .feature gets populated as expected which comes from all of these new returns. I'll think about whether explicit tests make sense or not.
app/models/detector/citation.rb
Outdated
| lastnames: /[A-Z][a-z]+[.,]/, | ||
| quotes: /".*?"/ | ||
| quotes: /".*?"/, | ||
| quotation_marks: /".*"/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested change: This appears to break integration with the lambda, as it results in an unexpected feature being presented in the payload. If I delete the existing cassettes and re-run the test suite (with the lambda running locally), I see error messages in the cassettes: Additional properties are not allowed (''quotation_marks'' was unexpected). I see the same behavior when I hook up the review app to the Dev1 lambda.
This is an aside, but we might need to tweak our test setup to alert that cassettes need regenerated (probably including consideration of the payload and not just the URL being targeted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating what I said in slack here for the record. This was not intended to make this PR. Will remove it an open a ticket for us to consider adding it. I really want a "any types of quotes" rollup rather than two separate checks.
app/models/detector/citation.rb
Outdated
| def self.record(term) | ||
| cit = Detector::Citation.new(term.phrase) | ||
| return unless cit.detection? | ||
| return cit.features unless cit.detection? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Seeing that the guard clause returns the same thing as the ultimate return value makes me wonder whether the guard clause is needed. Is it preferable to repeat a return statement in multiple places, or to wrap the Detection.find_or_create_by block in a conditional of some sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This felt a bit ick to me but I didn't give it enough thought on how to avoid it. I'll take a closer look.
| field :isbn, String, null: true | ||
| field :issn, String, null: true | ||
| field :journal, String, null: true | ||
| field :ml_citation, Boolean, null: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether I'm getting the expected value back from this in GraphQL or not. When I try submitting the search term from the "citation with all the things" fixture, I see everything come back except a null for the mlCitation value (and the journal value, but I don't have those populated locally).
Looking a bit more, I think the null is happening because the Detector::MlCitation class' .record method is still set to return nil in this PR - so if we extend the pattern you're using here for citation to ml_citation, that might resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I never testing the ml_citation thing as I was too lazy to figure out how to actually run that locally. I'll take a look to see how to make it work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...
test/models/term_test.rb
Outdated
| assert_not_nil features[:isbn] | ||
| assert_not_nil features[:issn] | ||
| assert_nil features[:journal] | ||
| assert_nil features[:ml_citation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we need a test to cover the expected cases where this value is not nil, and I don't see one here? Or should this assertion instead be assert_not_nil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.
This was not intended to be in this changeset
Also adds descriptions
|
@matt-bernhardt I've made some changes based on your review (solid feedback thanks!). I'm not entirely confident how to test the ml_citation populates and was hoping you could help me understand how to test that. (Slack/zoom?) I'm also no longer fond of the self.record method names for all of our Detectors now that those methods also return extracted features. . I'm proposing we leave the poor name in place for now and sit with it for a bit, but am open to changes. |
|
|
||
| module Types | ||
| class CountsType < Types::BaseObject | ||
| description 'Features extracted from the input term that are counts. Useful for machine learning.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add LCSH counts ticket to return to later
This makes a handful of changes that Jeremy and I discussed while we met to discuss the PR: - Updating the return from Detector::MlCitation.record method to focus on the boolean, rather than the Detector::MlCitation instance. - Adding a test to the Term suite to focus on how a citation will cause the features[:ml_citation] element to be not nil - Wrapping the test of all extracted features values in a block to enable the citation detector lambda, and a cassette to capture its response. - Adding the cassette referenced above
matt-bernhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the conversation earlier today about this PR. Your latest round of changes resolved a lot of my concerns, and the commit I just pushed should wrap up the last remaining loose ends that we reviewed.
From my perspective, this PR should now be ready to merge - but please check my work on the last commit as well.
| field :pages, Integer, null: true | ||
| field :lastnames, Integer, null: true, description: 'Count of lastnames in the input. Not recommended for use in production.' | ||
| field :no, Integer, null: true, description: 'Count of `no` in the input' | ||
| field :pages, Integer, null: true, description: 'Count of `pages` in the input' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature isn't looking for pages but page ranges a la 143-252. I'm not sure whether that distinction is worth changing this sentence for, however?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to eventually make these docs more transparent as to what the actual detector is doing but I couldn't think of a way to do that in a sustainable way.
| field :isbn, String, null: true | ||
| field :issn, String, null: true | ||
| field :journal, String, null: true | ||
| field :ml_citation, Boolean, null: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me...
app/models/detector/ml_citation.rb
Outdated
| ) | ||
| end | ||
|
|
||
| result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result at this moment isn't a boolean, but an instance of Detector::MlCitation. This seems to be causing problems later in the application's response processing, such that the GraphQL endpoint is always returning True for the ml_citation result.
It feels like the better response would be result.detections (although the plural feels awkward to me in this context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a commit to make this change after our discussion earlier today.
| Detector::Citation.record(self) | ||
| Detector::StandardIdentifiers.record(self) | ||
| Detector::Journal.record(self) | ||
| @ml_citation = Detector::MlCitation.record(self) if Detector::MlCitation.expected_env? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, with the caveat that the response from Detector::MlCitation.record(self) is currently not a boolean but an instance of Detector::MlCitation it seems. I think the right place to tweak is within that class (see comment above), so what's here should be fine.
test/models/term_test.rb
Outdated
| assert_not_nil features[:isbn] | ||
| assert_not_nil features[:issn] | ||
| assert_nil features[:journal] | ||
| assert_nil features[:ml_citation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a commit to add the test referenced here, which we reviewed in our discussion earlier today.
|
@matt-bernhardt thanks for the patient review, code fix, and extra tests! |
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Why are these changes being introduced:
Relevant ticket(s):
How does this address that need:
CountsTypeandFeaturesTypeGraphQL types.SearchEventTypeandTermTypeto expose new fields.detectormodels (citation,journal,standard_identifiers) to return data.featuresmethod to Term modelDocument any side effects to this change:
Summary of changes (please refer to commit messages for full details)
Developer
Ticket(s)
https://mitlibraries.atlassian.net/browse/TCO-###
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
NO migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing