-
Notifications
You must be signed in to change notification settings - Fork 0
Add term features and counts to GraphQL and models #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4380f5d
d3c49de
3819f44
c5a0615
cc3656e
d037ed9
23852a5
3c54f34
037be3d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| module Types | ||
| class CountsType < Types::BaseObject | ||
| description 'Features extracted from the input term that are counts. Useful for machine learning.' | ||
|
|
||
| field :apa_volume_issue, Integer, null: true, description: 'Count of apa volume number pattern in the input' | ||
| field :brackets, Integer, null: true, description: 'Count of brackets in the input' | ||
| field :characters, Integer, null: true, description: 'Count of characters in the input' | ||
| field :colons, Integer, null: true, description: 'Count of colons in the input' | ||
| field :commas, Integer, null: true, description: 'Count of commas in the input' | ||
| field :lastnames, Integer, null: true, description: 'Count of lastnames in the input. Not recommended for use in production.' | ||
| field :no, Integer, null: true, description: 'Count of `no` in the input' | ||
| field :pages, Integer, null: true, description: 'Count of `pages` in the input' | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feature isn't looking for
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to eventually make these docs more transparent as to what the actual detector is doing but I couldn't think of a way to do that in a sustainable way. |
||
| field :periods, Integer, null: true, description: 'Count of Periods in the input' | ||
| field :pp, Integer, null: true, description: 'Count of `pp` in the input' | ||
| field :quotes, Integer, null: true, description: 'Count of " in the input' | ||
| field :semicolons, Integer, null: true, description: 'Count of Semicolons in the input' | ||
| field :vol, Integer, null: true, description: 'Count of `vol` in the input' | ||
| field :words, Integer, null: true, description: 'Count of Words in the input' | ||
| field :year_parens, Integer, null: true, description: 'Count of (year) in the input' | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| module Types | ||
| class FeaturesType < Types::BaseObject | ||
JPrevost marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| description 'Features extracted from the input term. Useful for machine learning.' | ||
|
|
||
| field :barcode, String, null: true | ||
| field :counts, CountsType, null: true | ||
| field :doi, String, null: true | ||
| field :isbn, String, null: true | ||
| field :issn, String, null: true | ||
| field :journal, String, null: true | ||
| field :ml_citation, Boolean, null: true | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether I'm getting the expected value back from this in GraphQL or not. When I try submitting the search term from the "citation with all the things" fixture, I see everything come back except a null for the mlCitation value (and the journal value, but I don't have those populated locally). Looking a bit more, I think the null is happening because the Detector::MlCitation class' .record method is still set to return nil in this PR - so if we extend the pattern you're using here for citation to ml_citation, that might resolve this.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I never testing the ml_citation thing as I was too lazy to figure out how to actually run that locally. I'll take a look to see how to make it work as expected.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's still a tweak needed to the Term.record_detections method, but I'll detail that there. The features_type setup looks correct to me... |
||
| field :pmid, String, null: true | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -45,18 +45,19 @@ def self.partial_term_match(phrase) | |
| # @note This does not care whether multiple matching journals are detected. If _any_ match is found, a Detection | ||
| # record is created. The uniqueness constraint on Detection records would make multiple detections irrelevant. | ||
| # | ||
| # @return nil | ||
| # @return [Set of Journal] A set of ActiveRecord Journal records. | ||
matt-bernhardt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def self.record(term) | ||
| result = full_term_match(term.phrase) | ||
| return unless result.any? | ||
|
|
||
| Detection.find_or_create_by( | ||
| term:, | ||
| detector: Detector.where(name: 'Journal').first, | ||
| detector_version: ENV.fetch('DETECTOR_VERSION', 'unset') | ||
| ) | ||
| if result.any? | ||
| Detection.find_or_create_by( | ||
| term:, | ||
| detector: Detector.where(name: 'Journal').first, | ||
| detector_version: ENV.fetch('DETECTOR_VERSION', 'unset') | ||
| ) | ||
| end | ||
|
|
||
| nil | ||
| result | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: This comment applies equally to all three detector subclasses - it looks like we're able to change these returns from the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do have a test for it, but admittedly they are arguably side effects in that we test the |
||
| end | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,12 +48,16 @@ def cluster | |
| # The record_detections method is the one-stop method to call every Detector's record method that is defined within | ||
| # the application. | ||
| # | ||
| # @note This method is called by the calculate_categorizations method, so it is not necessary to call it separately. | ||
| # It is also called by the features method to ensure that the latest detections are always available if | ||
| # calculate_categorizations is not called first. | ||
| # | ||
| # @return nil | ||
| def record_detections | ||
| Detector::MlCitation.record(self) if Detector::MlCitation.expected_env? | ||
| Detector::Citation.record(self) | ||
| Detector::StandardIdentifiers.record(self) | ||
| Detector::Journal.record(self) | ||
| @ml_citation = Detector::MlCitation.record(self) if Detector::MlCitation.expected_env? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems fine, with the caveat that the response from |
||
| @citation_summary = Detector::Citation.record(self) | ||
| @std_identifiers = Detector::StandardIdentifiers.record(self) | ||
| @journal = Detector::Journal.record(self) | ||
| Detector::Lcsh.record(self) | ||
| @suggested_resource_category = Detector::SuggestedResource.record(self) | ||
| @suggested_pattern_category = Detector::SuggestedResourcePattern.record(self) | ||
|
|
@@ -91,6 +95,45 @@ def calculate_categorizations | |
| end | ||
| end | ||
|
|
||
| # Extracted features from various detectors are returned as a hash. This method is used to provide a summary of | ||
| # the Term's features, which are used for machine learning and other purposes. | ||
| # | ||
| # @return [Hash] a hash of features extracted from the Term | ||
| # | ||
| # @note This method will call record_detections if the @citation_summary is not already populated. | ||
| # This is to ensure that the features are always up-to-date with the latest detections. | ||
| def features | ||
matt-bernhardt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| record_detections if @citation_summary.blank? | ||
| { | ||
| counts: { | ||
| apa_volume_issue: @citation_summary[:apa_volume_issue] || 0, | ||
| vol: @citation_summary[:vol] || 0, | ||
| no: @citation_summary[:no] || 0, | ||
| pages: @citation_summary[:pages] || 0, | ||
| pp: @citation_summary[:pp] || 0, | ||
| year_parens: @citation_summary[:year_parens] || 0, | ||
| brackets: @citation_summary[:brackets] || 0, | ||
| lastnames: @citation_summary[:lastnames] || 0, | ||
|
|
||
| characters: @citation_summary[:characters] || 0, | ||
| colons: @citation_summary[:colons] || 0, | ||
| commas: @citation_summary[:commas] || 0, | ||
| quotes: @citation_summary[:quotes] || 0, | ||
| periods: @citation_summary[:periods] || 0, | ||
| semicolons: @citation_summary[:semicolons] || 0, | ||
| words: @citation_summary[:words] || 0 | ||
| }, | ||
|
|
||
| barcode: @std_identifiers[:barcode], | ||
| doi: @std_identifiers[:doi], | ||
| isbn: @std_identifiers[:isbn], | ||
| issn: @std_identifiers[:issn], | ||
| pmid: @std_identifiers[:pmid], | ||
| journal: @journal&.first&.name, | ||
| ml_citation: @ml_citation | ||
| } | ||
| end | ||
|
|
||
| private | ||
|
|
||
| # register_fingerprint method gets called before a Term record is saved, ensuring that Terms should always have a | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add LCSH counts ticket to return to later