Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ changes, this is the signal which indicates that terms need to be re-evaluated.

### Optional

`DETECTOR_LAMBDA_CHALLENGE_SECRET`: The secret phrase required by the external citation detector to process any request. If not present, the detector will not respond.
`DETECTOR_LAMBDA_PATH`: The path specified by the external citation detector for prediction requests. If not present, the citation detector will not be consulted.
`DETECTOR_LAMBDA_URL`: The address for an external citation detector, if present. If not present, the citation detector will not be consulted.

`LIBKEY_KEY`: LibKey API key. Required if `LIBKEY_DOI` or `LIBKEY_PMID` are set.
`LIBKEY_ID`: LibKey Library ID. Required if `LIBKEY_DOI` or `LIBKEY_PMID` are set.
`LIBKEY_DOI`: If set, use LibKey for DOI metadata lookups. If not set, Unpaywall is used.
Expand Down
19 changes: 17 additions & 2 deletions app/models/detector/citation.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ class Detector
# hallmarks of being a citation.
# Phrases whose score is higher than the REQUIRED_SCORE value can be registered as a Detection.
class Citation
attr_reader :score, :subpatterns, :summary
attr_reader :features, :score, :subpatterns, :summary

# shared singleton methods
extend Detector::BulkChecker
Expand Down Expand Up @@ -67,10 +67,13 @@ def detection?
# @return Nothing intentional. Data is written to Hashes `@subpatterns`, `@summary`,
# and `@score` during processing.
def initialize(phrase)
@features = {}
@subpatterns = {}
@summary = {}
pattern_checker(phrase)
summarize(phrase)
extract_features
@subpatterns.delete_if { |_, v| v == [] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do this cleanup after we use that data in the extract_features method and not before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is needed because we're dropping the conditional in on line 144 (the next comment). The current sequence only populates @subpatterns for cases where there's a match:

{lastnames: ["Hello,"]}

However, for the @features variable we need those unmatched cases to be zeros.

With this PR, the initial state of @subpatterns would be

{apa_volume_issue: [], no: [], pages: [], pp: [], vol: [], year_parens: [], brackets: [], lastnames: ["Hello,"], quotes: []}

In that state, the extract_features command can very easily convert that to
{apa_volume_issue: 0, no: 0, pages: 0, pp: 0, vol: 0, year_parens: 0, brackets: 0, lastnames: 1, quotes: 0}

(and then merge that with the @summary variable for the full set of features).

This line then completes the cleanup of @subpatterns, and gets it back to just

{lastnames: ["Hello,"]}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit awkward, but I'll willing to consider it as such right now and move on :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a middle-ground state, and should be simplified in the near future. While the detector lambda isn't actually in service because we're working through the deploy process (and the ticket to implement the prediction capability isn't complete), we need to preserve the legacy citation detector.

Once the new system is fully in place, we will be free to refactor the feature extraction to not use this type of approach.

@score = calculate_score
end

Expand Down Expand Up @@ -135,13 +138,25 @@ def commas(phrase)
phrase.count(',')
end

# This converts the already-built @subpatterns and @summary instance variables into the @features instance variable,
# which has a format suitable for sending to our prediction algorithm.
def extract_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future refactor to consider (not recommending we take this on now).

I'm curious why instance variables and attr_accessors are being used rather than just making this a public method and renaming it features. The reason I suggest we not change this now (even if you agree with it, which you might not) is that other aspects of this model are using instance_variables and attr_accessors just like this instead of moving some methods to public so we'd be refactor a fair bit... which might be better to do when we either refactor the application to be more feature centric or decide not to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is an option to consider, but agree with you that it should be delayed a bit. I'd be very interested in exploring the refactor to shift the feature extraction a bit, and considering this option in that context.

# Need to create a separate instance variable, so use .deep_dup
@features = @subpatterns.deep_dup
# Convert the @subpattern structure of {no: = [], pages: ['194-204']} (a hash of matched substrings, with some
# empty) into {no: 0, pages: 1} (a hash of integers, some zero)
@features = @features.transform_values(&:length)
# Now join the re-shaped hash with the @summary variable, so everything is in one place.
@features = @features.merge(summary)
end

# This builds one of the two main components of the Citation detector - the subpattern report. It uses each of the
# regular expressions in the CITATION_PATTERNS constant, extracting all matches using the scan method.
#
# @return hash
def pattern_checker(phrase)
CITATION_PATTERNS.each_pair do |type, pattern|
@subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present?
@subpatterns[type.to_sym] = scan(pattern, phrase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find the previous conditional did nothing useful here? I have no concerns specifically with removing the conditional but I'm curious if you realized it was never needed or if something changed to no longer need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this just above about the new @subpatterns.delete_if { ... } block, but - no, this conditional was exactly what we needed at the time, but it was preventing me from generating the full set of features for cases that the conditional blocked. So I'm proposing we remove the check from this method, and instead perform the cleanup a step later, after we've populated the @features variable via extract_features.

end
end

Expand Down
152 changes: 152 additions & 0 deletions app/models/detector/ml_citation.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# frozen_string_literal: true

class Detector
class MlCitation
attr_reader :detections

# For now the initialize method just needs to consult the external lambda.
#
# @param phrase String. Often a `Term.phrase`.
# @return Nothing intentional. Data is written to Hash `@detections` during processing.
def initialize(phrase)
return unless self.class.expected_env?

response = fetch(phrase)
@detections = response unless response == 'Error'
end

def detection?
@detections == true
end

# expected_env? confirms that all three required environment variables are defined. It is provided for the Term
# model to check prior to calling because this is still an optional extension to TACOS. If this method returns
# false, the Term model will fall back to the initial citation detector.
#
# @return Boolean
def self.expected_env?
Rails.logger.error('No lambda URL defined') if lambda_url.nil?

Rails.logger.error('No lambda path defined') if lambda_path.nil?

Rails.logger.error('No lambda secret defined') if lambda_secret.nil?

[lambda_url, lambda_path, lambda_secret].all?(&:present?)
end

# The record method runs a supplied term through the detector via its initialize method, which consults the lambda.
# If a positive result is received, a Detection is registered.
#
# @param term [Term]
# @return nil
def self.record(term)
result = Detector::MlCitation.new(term.phrase)
return unless result.detection?

# Detections are registered to the "MlCitation" detector for now, but may end up replacing the "Citation" detector
# in a future step.
Detection.find_or_create_by(
term:,
detector: Detector.where(name: 'MlCitation').first,
detector_version: ENV.fetch('DETECTOR_VERSION', 'unset')
)

nil
end

# lambda_path reads and returns the value of one environment variable.
#
# @note This is a public class method because the entire class ends up getting called in both class and instance
# contexts, due to how detectors are built. The ideal state would be a private method, but that would require
# changing how the class calls itself via the fetch method.
#
# @see Detector::MlCitation.expected_env?
# @see Detector::MlCitation.fetch
# @return String or nil
def self.lambda_path
ENV.fetch('DETECTOR_LAMBDA_PATH', nil)
end

# lambda_secret reads and returns the value of one environment variable.
#
# @note This is a public class method because the entire class ends up getting called in both class and instance
# contexts, due to how detectors are built. The ideal state would be a private method, but that would require
# changing how the class calls itself via the fetch method.
#
# @see Detector::MlCitation.expected_env?
# @see Detector::MlCitation.fetch
# @return String or nil
def self.lambda_secret
ENV.fetch('DETECTOR_LAMBDA_CHALLENGE_SECRET', nil)
end

# lambda_url reads and returns the value of one environment variable.
#
# @note This is a public class method because the entire class ends up getting called in both class and instance
# contexts, due to how detectors are built. The ideal state would be a private method, but that would require
# changing how the class calls itself via the fetch method.
#
# @see Detector::MlCitation.expected_env?
# @see Detector::MlCitation.fetch
# @return String or nil
def self.lambda_url
ENV.fetch('DETECTOR_LAMBDA_URL', nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these three lambda methods get called anywhere outside this method? They feel like that should move to the private methods if not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't. I wanted them to be private methods, but was having problems getting them referenced as private methods in some cases. the .expected_env? method that references these (among other references in this model) gets called both as instance and class methods, and making this public was the only way I could make sure that they were reliably available in all cases.

I'm not particularly happy about this arrangement, so I'm happy to see if I missed something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the TL;DR is that it turns out they need to be public. I've tried making them private, and tried making them protected. Just moving them below the private declaration seems like it works, but CodeClimate and Rubocop point out that this isn't the way to make private class methods. Using the actual private_class_method prefix makes the linter happy, but then tests fail galore.

Ditto with trying a protected method.

The issue seems to boil down to the variety of ways that this method gets called. Externally, everything gets called as class methods - but then within the record method, the class dog-foods itself via instantiation, and the initialize method in turn calls expected_env? - which means that method gets used in both contexts, which means that the methods it calls being private invites a world of pain and suffering.

The sound you hear is me forcing myself not to just refactor all of our detector classes to avoid this dual-context situation. It probably needs to get added to our list of tech debt, but for right now having these be public methods is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.

end

private

# define_lambda connects to the detector lambda.
#
# @return Faraday connection
def define_lambda
Faraday.new(
url: self.class.lambda_url,
params: {}
)
end

# define_payload defines the Hash that will be sent to the lambda.
#
# @return Hash
def define_payload(phrase)
{
action: 'predict',
features: extract_features(phrase),
challenge_secret: self.class.lambda_secret
}
end

# extract_features passes the search phrase through the citation detector, and massages the resulting features object
# to correspond with what the lambda expects.
#
# @return Hash
def extract_features(phrase)
features = Detector::Citation.new(phrase).features
features[:apa] = features.delete :apa_volume_issue
features[:year] = features.delete :year_parens
features.delete :characters
features
end

# Fetch handles the communication with the detector lambda: defining the connection, building the payload, and any
# error handling with the response.
#
# @return Boolean or 'Error'
def fetch(phrase)
lambda = define_lambda
payload = define_payload(phrase)

response = lambda.post(self.class.lambda_path, payload.to_json)

if response.status == 200
JSON.parse(response.body)['response'] == 'true'
else
Rails.logger.error(response.body)
Sentry.set_extras({ body: response.body })
Sentry.capture_message('Non-200 response received from detector lambda')

'Error'
end
end
end
end
1 change: 1 addition & 0 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ def cluster
#
# @return nil
def record_detections
Detector::MlCitation.record(self) if Detector::MlCitation.expected_env?
Detector::Citation.record(self)
Detector::StandardIdentifiers.record(self)
Detector::Journal.record(self)
Expand Down
6 changes: 6 additions & 0 deletions db/seeds.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
Detector.find_or_create_by(name: 'Journal')
Detector.find_or_create_by(name: 'SuggestedResource')
Detector.find_or_create_by(name: 'Citation')
Detector.find_or_create_by(name: 'MlCitation')
Detector.find_or_create_by(name: 'Barcode')
Detector.find_or_create_by(name: 'SuggestedResourcePattern')

Expand Down Expand Up @@ -75,6 +76,11 @@
category: Category.find_by(name: 'Informational'),
confidence: 0.7
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'MlCitation'),
category: Category.find_by(name: 'Transactional'),
confidence: 0.95
)
DetectorCategory.find_or_create_by(
detector: Detector.find_by(name: 'PMID'),
category: Category.find_by(name: 'Transactional'),
Expand Down
5 changes: 5 additions & 0 deletions test/fixtures/detector_categories.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,8 @@ eight:
detector: barcode
category: transactional
confidence: 0.95

nine:
detector: mlcitation
category: transactional
confidence: 0.95
3 changes: 3 additions & 0 deletions test/fixtures/detectors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ barcode:
citation:
name: 'Citation'

mlcitation:
name: 'MlCitation'

doi:
name: 'DOI'

Expand Down
27 changes: 26 additions & 1 deletion test/models/detector/citation_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@

class Detector
class CitationTest < ActiveSupport::TestCase
test 'detector::citation exposes three instance variables' do
test 'detector::citation exposes four instance variables' do
t = terms('citation')
result = Detector::Citation.new(t.phrase)

assert_predicate result.features, :present?

assert_predicate result.score, :present?

assert_predicate result.summary, :present?
Expand Down Expand Up @@ -196,6 +198,29 @@ class CitationTest < ActiveSupport::TestCase
assert_operator 0, :<, result.score
end

test 'features instance method is a hash of integers' do
result = Detector::Citation.new('simple search phrase')

assert_instance_of(Hash, result.features)

assert(result.features.all? { |_, v| v.integer? })
end

test 'features instance method includes all elements of citation detector regardless of search string' do
result_simple = Detector::Citation.new('simple')
result_complex = Detector::Citation.new('Science Education and Cultural Diversity: Mapping the Field. Studies in Science Education, 24(1), 49–73.')

assert_equal result_simple.features.length, result_complex.features.length
end

test 'features instance method should include all elements of citation patterns and summary thresholds' do
patterns = Detector::Citation.const_get :CITATION_PATTERNS
summary = Detector::Citation.const_get :SUMMARY_THRESHOLDS
result = Detector::Citation.new('simple')

assert_equal (patterns.length + summary.length), result.features.length
end

test 'detection? convenience method returns true for obvious citations' do
result = Detector::Citation.new(terms('citation').phrase)

Expand Down
Loading