-
Notifications
You must be signed in to change notification settings - Fork 0
Integrate with external citation detector via Detector::MlCitation class #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c76ad56
d3f2b9b
99636da
6c58aa6
2fa42d2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,7 +10,7 @@ class Detector | |
| # hallmarks of being a citation. | ||
| # Phrases whose score is higher than the REQUIRED_SCORE value can be registered as a Detection. | ||
| class Citation | ||
| attr_reader :score, :subpatterns, :summary | ||
| attr_reader :features, :score, :subpatterns, :summary | ||
|
|
||
| # shared singleton methods | ||
| extend Detector::BulkChecker | ||
|
|
@@ -67,10 +67,13 @@ def detection? | |
| # @return Nothing intentional. Data is written to Hashes `@subpatterns`, `@summary`, | ||
| # and `@score` during processing. | ||
| def initialize(phrase) | ||
| @features = {} | ||
| @subpatterns = {} | ||
| @summary = {} | ||
| pattern_checker(phrase) | ||
| summarize(phrase) | ||
| extract_features | ||
| @subpatterns.delete_if { |_, v| v == [] } | ||
| @score = calculate_score | ||
| end | ||
|
|
||
|
|
@@ -135,13 +138,25 @@ def commas(phrase) | |
| phrase.count(',') | ||
| end | ||
|
|
||
| # This converts the already-built @subpatterns and @summary instance variables into the @features instance variable, | ||
| # which has a format suitable for sending to our prediction algorithm. | ||
| def extract_features | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Future refactor to consider (not recommending we take this on now). I'm curious why instance variables and attr_accessors are being used rather than just making this a public method and renaming it
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think this is an option to consider, but agree with you that it should be delayed a bit. I'd be very interested in exploring the refactor to shift the feature extraction a bit, and considering this option in that context. |
||
| # Need to create a separate instance variable, so use .deep_dup | ||
| @features = @subpatterns.deep_dup | ||
| # Convert the @subpattern structure of {no: = [], pages: ['194-204']} (a hash of matched substrings, with some | ||
| # empty) into {no: 0, pages: 1} (a hash of integers, some zero) | ||
| @features = @features.transform_values(&:length) | ||
| # Now join the re-shaped hash with the @summary variable, so everything is in one place. | ||
| @features = @features.merge(summary) | ||
| end | ||
|
|
||
| # This builds one of the two main components of the Citation detector - the subpattern report. It uses each of the | ||
| # regular expressions in the CITATION_PATTERNS constant, extracting all matches using the scan method. | ||
| # | ||
| # @return hash | ||
| def pattern_checker(phrase) | ||
| CITATION_PATTERNS.each_pair do |type, pattern| | ||
| @subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present? | ||
| @subpatterns[type.to_sym] = scan(pattern, phrase) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you find the previous conditional did nothing useful here? I have no concerns specifically with removing the conditional but I'm curious if you realized it was never needed or if something changed to no longer need it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mentioned this just above about the new |
||
| end | ||
| end | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| class Detector | ||
| class MlCitation | ||
| attr_reader :detections | ||
|
|
||
| # For now the initialize method just needs to consult the external lambda. | ||
| # | ||
| # @param phrase String. Often a `Term.phrase`. | ||
| # @return Nothing intentional. Data is written to Hash `@detections` during processing. | ||
| def initialize(phrase) | ||
| return unless self.class.expected_env? | ||
|
|
||
| response = fetch(phrase) | ||
| @detections = response unless response == 'Error' | ||
| end | ||
|
|
||
| def detection? | ||
| @detections == true | ||
| end | ||
|
|
||
| # expected_env? confirms that all three required environment variables are defined. It is provided for the Term | ||
| # model to check prior to calling because this is still an optional extension to TACOS. If this method returns | ||
| # false, the Term model will fall back to the initial citation detector. | ||
| # | ||
| # @return Boolean | ||
| def self.expected_env? | ||
| Rails.logger.error('No lambda URL defined') if lambda_url.nil? | ||
|
|
||
| Rails.logger.error('No lambda path defined') if lambda_path.nil? | ||
|
|
||
| Rails.logger.error('No lambda secret defined') if lambda_secret.nil? | ||
|
|
||
| [lambda_url, lambda_path, lambda_secret].all?(&:present?) | ||
| end | ||
|
|
||
| # The record method runs a supplied term through the detector via its initialize method, which consults the lambda. | ||
| # If a positive result is received, a Detection is registered. | ||
| # | ||
| # @param term [Term] | ||
| # @return nil | ||
| def self.record(term) | ||
| result = Detector::MlCitation.new(term.phrase) | ||
| return unless result.detection? | ||
|
|
||
| # Detections are registered to the "MlCitation" detector for now, but may end up replacing the "Citation" detector | ||
| # in a future step. | ||
| Detection.find_or_create_by( | ||
| term:, | ||
| detector: Detector.where(name: 'MlCitation').first, | ||
| detector_version: ENV.fetch('DETECTOR_VERSION', 'unset') | ||
| ) | ||
|
|
||
| nil | ||
| end | ||
|
|
||
| # lambda_path reads and returns the value of one environment variable. | ||
| # | ||
| # @note This is a public class method because the entire class ends up getting called in both class and instance | ||
| # contexts, due to how detectors are built. The ideal state would be a private method, but that would require | ||
| # changing how the class calls itself via the fetch method. | ||
| # | ||
| # @see Detector::MlCitation.expected_env? | ||
| # @see Detector::MlCitation.fetch | ||
| # @return String or nil | ||
| def self.lambda_path | ||
| ENV.fetch('DETECTOR_LAMBDA_PATH', nil) | ||
| end | ||
|
|
||
| # lambda_secret reads and returns the value of one environment variable. | ||
| # | ||
| # @note This is a public class method because the entire class ends up getting called in both class and instance | ||
| # contexts, due to how detectors are built. The ideal state would be a private method, but that would require | ||
| # changing how the class calls itself via the fetch method. | ||
| # | ||
| # @see Detector::MlCitation.expected_env? | ||
| # @see Detector::MlCitation.fetch | ||
| # @return String or nil | ||
| def self.lambda_secret | ||
| ENV.fetch('DETECTOR_LAMBDA_CHALLENGE_SECRET', nil) | ||
| end | ||
|
|
||
| # lambda_url reads and returns the value of one environment variable. | ||
| # | ||
| # @note This is a public class method because the entire class ends up getting called in both class and instance | ||
| # contexts, due to how detectors are built. The ideal state would be a private method, but that would require | ||
| # changing how the class calls itself via the fetch method. | ||
| # | ||
| # @see Detector::MlCitation.expected_env? | ||
| # @see Detector::MlCitation.fetch | ||
| # @return String or nil | ||
| def self.lambda_url | ||
| ENV.fetch('DETECTOR_LAMBDA_URL', nil) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do these three lambda methods get called anywhere outside this method? They feel like that should move to the private methods if not.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They don't. I wanted them to be private methods, but was having problems getting them referenced as private methods in some cases. the I'm not particularly happy about this arrangement, so I'm happy to see if I missed something.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, the TL;DR is that it turns out they need to be public. I've tried making them private, and tried making them protected. Just moving them below the Ditto with trying a protected method. The issue seems to boil down to the variety of ways that this method gets called. Externally, everything gets called as class methods - but then within the record method, the class dog-foods itself via instantiation, and the initialize method in turn calls expected_env? - which means that method gets used in both contexts, which means that the methods it calls being private invites a world of pain and suffering. The sound you hear is me forcing myself not to just refactor all of our detector classes to avoid this dual-context situation. It probably needs to get added to our list of tech debt, but for right now having these be public methods is the way to go.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change. |
||
| end | ||
|
|
||
| private | ||
|
|
||
| # define_lambda connects to the detector lambda. | ||
| # | ||
| # @return Faraday connection | ||
| def define_lambda | ||
| Faraday.new( | ||
| url: self.class.lambda_url, | ||
| params: {} | ||
| ) | ||
| end | ||
|
|
||
| # define_payload defines the Hash that will be sent to the lambda. | ||
| # | ||
| # @return Hash | ||
| def define_payload(phrase) | ||
| { | ||
| action: 'predict', | ||
| features: extract_features(phrase), | ||
| challenge_secret: self.class.lambda_secret | ||
| } | ||
| end | ||
|
|
||
| # extract_features passes the search phrase through the citation detector, and massages the resulting features object | ||
| # to correspond with what the lambda expects. | ||
| # | ||
| # @return Hash | ||
| def extract_features(phrase) | ||
| features = Detector::Citation.new(phrase).features | ||
| features[:apa] = features.delete :apa_volume_issue | ||
| features[:year] = features.delete :year_parens | ||
| features.delete :characters | ||
| features | ||
| end | ||
|
|
||
| # Fetch handles the communication with the detector lambda: defining the connection, building the payload, and any | ||
| # error handling with the response. | ||
| # | ||
| # @return Boolean or 'Error' | ||
| def fetch(phrase) | ||
| lambda = define_lambda | ||
| payload = define_payload(phrase) | ||
|
|
||
| response = lambda.post(self.class.lambda_path, payload.to_json) | ||
|
|
||
| if response.status == 200 | ||
| JSON.parse(response.body)['response'] == 'true' | ||
| else | ||
| Rails.logger.error(response.body) | ||
JPrevost marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Sentry.set_extras({ body: response.body }) | ||
| Sentry.capture_message('Non-200 response received from detector lambda') | ||
|
|
||
| 'Error' | ||
| end | ||
| end | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,9 @@ barcode: | |
| citation: | ||
| name: 'Citation' | ||
|
|
||
| mlcitation: | ||
| name: 'MlCitation' | ||
|
|
||
| doi: | ||
| name: 'DOI' | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do this cleanup after we use that data in the
extract_featuresmethod and not before?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup is needed because we're dropping the conditional in on line 144 (the next comment). The current sequence only populates
@subpatternsfor cases where there's a match:{lastnames: ["Hello,"]}However, for the
@featuresvariable we need those unmatched cases to be zeros.With this PR, the initial state of
@subpatternswould be{apa_volume_issue: [], no: [], pages: [], pp: [], vol: [], year_parens: [], brackets: [], lastnames: ["Hello,"], quotes: []}In that state, the
extract_featurescommand can very easily convert that to{apa_volume_issue: 0, no: 0, pages: 0, pp: 0, vol: 0, year_parens: 0, brackets: 0, lastnames: 1, quotes: 0}(and then merge that with the
@summaryvariable for the full set of features).This line then completes the cleanup of
@subpatterns, and gets it back to just{lastnames: ["Hello,"]}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit awkward, but I'll willing to consider it as such right now and move on :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a middle-ground state, and should be simplified in the near future. While the detector lambda isn't actually in service because we're working through the deploy process (and the ticket to implement the prediction capability isn't complete), we need to preserve the legacy citation detector.
Once the new system is fully in place, we will be free to refactor the feature extraction to not use this type of approach.