-
Notifications
You must be signed in to change notification settings - Fork 0
Integrate with external citation detector via Detector::MlCitation class #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
** Why are these changes being introduced: In order to integrate TACOS with the external lambda, we need to send a hash of the extracted features. This is done implicitly within the existing citation detector, but not exposed in the format the lambda will need. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This adds a fourth instance variable to the citation detector: @features This variable is in the format needed by the lambda, and is derived from the existing variables @subpatterns and @summary. There is a new private method which handles this compilation, which does a semi-complex bit of manipulation. I've tried to comment its operation thoroughly. ** Document any side effects to this change: The instance variables provided by the citation detector now have some duplication, as everything in the @summary variable is also contained in the @features variable. Ultimately we'll probably be able to remove the older instance variables, but that will be a separate refactor.
** Why are these changes being introduced: We need a detector class that will handle communication with the new external machine learning algorithm for citation detection. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This creates a new detector, detector::ml_citation. Along with this class, we create seeds and fixtures which treat this as a _separate_ detector from the existing citation detector (see side effects). As of this commit, the detector is not integrated with the rest of the application, as the term model's record_detections method has not yet been changed. There are also tests, along with three VCR cassettes which simulate its responses in the test suite. Please note, though, that the detector requires a trio of environment variables to be set in order to do any work. Those variables have not been set in .env.test, because that will cause a large number of tests that have nothing to do with this feature to now need cassettes. Instead, there is a helper function that needs to be called for any test that needs the new ML citation detector. For local development, you'll want to have the running lambda on your laptop somewhere (see the tacos-dtectors-lambdas repo for instructions). ** Document any side effects to this change: I'm not sure we will want this to be a separate / parellel detector, or if this should be a replacement for the existing citation detector while updating the app's DETECTOR_VERSION variable. This sort of significant upgrade to functionality is the sort of use case we envisioned for that variable in the first place, because having this new capability should mean that we re-process terms that had already been evaluated and categorized.
** Why are these changes being introduced: The term model will need to interact with the new detector, but only when we are ready to use it. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This adds the call to Detector::MlCitation.record(term) in the term model's record_detections, model, but guarded by a check of the environment variables. As I write this, the detector has not been deployed to AWS, so that guard is relied upon heavily. It is also why these three env vars have not been defined anywhere in Heroku. There is one new test on the term model confirming both sides of this guard clause. ** Document any side effects to this change: None, I don't think.
JPrevost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. It's a bit hard to test without something to call locally but I think I'm okay with that for now and I'll spend more time on the integration aspects when the lambda is a bit further along.
Not all of my comments require change, but I suspect something in them will drive you to want to make small tweaks. If not, let me know and I'll switch the status to approved.
|
|
||
| # This converts the already-built @subpatterns and @summary instance variables into the @features instance variable, | ||
| # which has a format suitable for sending to our prediction algorithm. | ||
| def extract_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future refactor to consider (not recommending we take this on now).
I'm curious why instance variables and attr_accessors are being used rather than just making this a public method and renaming it features. The reason I suggest we not change this now (even if you agree with it, which you might not) is that other aspects of this model are using instance_variables and attr_accessors just like this instead of moving some methods to public so we'd be refactor a fair bit... which might be better to do when we either refactor the application to be more feature centric or decide not to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is an option to consider, but agree with you that it should be delayed a bit. I'd be very interested in exploring the refactor to shift the feature extraction a bit, and considering this option in that context.
| pattern_checker(phrase) | ||
| summarize(phrase) | ||
| extract_features | ||
| @subpatterns.delete_if { |_, v| v == [] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do this cleanup after we use that data in the extract_features method and not before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup is needed because we're dropping the conditional in on line 144 (the next comment). The current sequence only populates @subpatterns for cases where there's a match:
{lastnames: ["Hello,"]}
However, for the @features variable we need those unmatched cases to be zeros.
With this PR, the initial state of @subpatterns would be
{apa_volume_issue: [], no: [], pages: [], pp: [], vol: [], year_parens: [], brackets: [], lastnames: ["Hello,"], quotes: []}
In that state, the extract_features command can very easily convert that to
{apa_volume_issue: 0, no: 0, pages: 0, pp: 0, vol: 0, year_parens: 0, brackets: 0, lastnames: 1, quotes: 0}
(and then merge that with the @summary variable for the full set of features).
This line then completes the cleanup of @subpatterns, and gets it back to just
{lastnames: ["Hello,"]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit awkward, but I'll willing to consider it as such right now and move on :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a middle-ground state, and should be simplified in the near future. While the detector lambda isn't actually in service because we're working through the deploy process (and the ticket to implement the prediction capability isn't complete), we need to preserve the legacy citation detector.
Once the new system is fully in place, we will be free to refactor the feature extraction to not use this type of approach.
| def pattern_checker(phrase) | ||
| CITATION_PATTERNS.each_pair do |type, pattern| | ||
| @subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present? | ||
| @subpatterns[type.to_sym] = scan(pattern, phrase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find the previous conditional did nothing useful here? I have no concerns specifically with removing the conditional but I'm curious if you realized it was never needed or if something changed to no longer need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this just above about the new @subpatterns.delete_if { ... } block, but - no, this conditional was exactly what we needed at the time, but it was preventing me from generating the full set of features for cases that the conditional blocked. So I'm proposing we remove the check from this method, and instead perform the cleanup a step later, after we've populated the @features variable via extract_features.
app/models/detector/ml_citation.rb
Outdated
| external_data = fetch(phrase) | ||
| return if external_data == 'Error' | ||
|
|
||
| @detections = external_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit confusing to me as I found myself scanning for a method external_data rather than looking above for the variable.
Would you consider something like:
@detections = fetch(phrase)
return if external_data == 'Error'There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give that a shot - I like the shorter method, and this may also allow us to do better error handling. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this comment thread as a bit of code therapy, but may end up saying some of this in the ticket if it feels like it'd be useful long-term.
That specific arrangement doesn't work because external_data never gets populated, so the return clause can't do anything. I could change it to return if @detections == 'Error', but... without anything else in the method I'm not sure what that would accomplish. The entire method could just be
def initialize(phrase)
return unless self.class.expected_env?
@detections = fetch(phrase)
endThat has the advantage of being nice and simple, so maybe that's the winner. What I'm proposing initially, though, tries to prevent errors from being passed upward - hence the early return prior to @detections being populated in case something goes wrong getting a response back from the lambda. I'm not sure that the Term model or anything farther up the application needs to know that the communication with the citation detector went haywire?
The categorization process, the record-keeping in the term model, and the GraphQL response can all do their work without any input from this detector, so my thinking initially was that this detector either communicates a result back, or it handles errors on its own (with the coming addition of sending them to Sentry).
If that makes sense to you, I'm open to trying something like this?
def initialize(phrase)
return unless self.class.expected_env?
return if fetch(phrase) == 'Error'
@detections = fetch(phrase)
end(This may yet need some work, as trying it locally causes three tests to fail with cassette problems. But it should be manageable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... I think I see why the cassette problems happen. I was anticipating maybe there being some caching magic in place that would prevent multiple calls to fetch(phrase) from actually hitting the lambda twice - but that doesn't seem to be the case.
That leaves me with wanting:
- Hit the lambda once, and check the result.
- Return quickly, changing nothing, if that result is anything other than a valid response
- Populate the
@detectionsinstance variable if the result is a valid response
Ooh... I think I've got it. I'll add the new option to my next commit.
| end | ||
|
|
||
| def self.lambda_url | ||
| ENV.fetch('DETECTOR_LAMBDA_URL', nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these three lambda methods get called anywhere outside this method? They feel like that should move to the private methods if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't. I wanted them to be private methods, but was having problems getting them referenced as private methods in some cases. the .expected_env? method that references these (among other references in this model) gets called both as instance and class methods, and making this public was the only way I could make sure that they were reliably available in all cases.
I'm not particularly happy about this arrangement, so I'm happy to see if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the TL;DR is that it turns out they need to be public. I've tried making them private, and tried making them protected. Just moving them below the private declaration seems like it works, but CodeClimate and Rubocop point out that this isn't the way to make private class methods. Using the actual private_class_method prefix makes the linter happy, but then tests fail galore.
Ditto with trying a protected method.
The issue seems to boil down to the variety of ways that this method gets called. Externally, everything gets called as class methods - but then within the record method, the class dog-foods itself via instantiation, and the initialize method in turn calls expected_env? - which means that method gets used in both contexts, which means that the methods it calls being private invites a world of pain and suffering.
The sound you hear is me forcing myself not to just refactor all of our detector classes to avoid this dual-context situation. It probably needs to get added to our list of tech debt, but for right now having these be public methods is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.
| end | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should leave this helper in the ml_citation_test file for now until it is needed elsewhere. It seems unlikely that we'd actually need this helper anywhere else so it feels a bit more complex to have to look here rather than just having it at the top of the file being tested? This is not a strong feeling so if you feel it is better here that works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. You use it it both ml_citation_test and term_test already. I'm leaving my comment so you can see what I was thinking... however, I now think this is the correct location for this helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - if it weren't for this test (and the realization that more tests at this model might rely on that method), I'd have implemented it in ml_citation_test.
| end | ||
| end | ||
|
|
||
| # These next three tests may not be needed, since the expected_env? method is tested above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment feels a bit fragile as someone could insert a test in between any of these at some point. Maybe either drop the tests you feel are redundant or add the a more focused comment to each test such as:
This test may not be needed, since the expected_env? method is also tested explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I'll take a fresh look at these three tests, and at least wrap them individually, if not remove them.
| end | ||
| end | ||
|
|
||
| test 'lookup returns nil when challenge_secret is wrong' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is doing a lot of work by handling "all" of the possible error states. I'm not suggesting we test every possible error that could come back, but I think a test that has a name that is clear it is testing non-200 status would be great.
test 'non 200 http status responses result in no detection' do
...
endThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not particularly devoted to the name I used for this test, but it might be good to talk about the way you're interpreting this? I had thought that the name I used was clear for what was being tested, and that this wasn't a general "text all the error states" type of test.
In fact, I'm a little worried about whether I'm ignoring too many error states with this PR, or whether error handling is implemented thoroughly enough.
|
There's some good points for improvement in here, so I'll be tweaking some things based on your feedback. I've also tried to answer some of the questions you've asked - but there's one discussion about the error handling test(s) that might be good to have this afternoon? In the meantime, I'll make some tweaks and let you know when it's ready for re-review. |
This responds to several of the notes received during code review. In no particular order: - The initialize mehod is refactored to be somewhat shorter, with better variable names and hopefully a clearer method of handling errors received from the fetch method. - Sentry logging is explicitly invoked during error handling from the lambda. I'm trying the .set_extras method here to see how that behaves. - Three likely-unnecessary tests have been removed, because the public method .expected_env? is already tested elsewhere. - The error handling test is renamed to cover all non-200 responses, rather than just the specific means used to trigger a non-200 response. Later work to differentiate between non-200 responses is still available should we want it. Still to-do is to dig in on the env-reading class methods, to see why things go sideways when I make them private.
|
Ok, I think I've gotten the changes made that feel feasible without picking at some very dangerous threads - and it's ready for re-review on Monday. |
JPrevost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit more technical debt being committed here than ideal for a brand new class, but it feels better than holding off until we change all the things. 🤞🏻
| pattern_checker(phrase) | ||
| summarize(phrase) | ||
| extract_features | ||
| @subpatterns.delete_if { |_, v| v == [] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit awkward, but I'll willing to consider it as such right now and move on :)
| end | ||
|
|
||
| def self.lambda_url | ||
| ENV.fetch('DETECTOR_LAMBDA_URL', nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.
|
I've added docstrings to those three must-be-public methods in the ml_citation model, with a note about why they need to be public. I'm going to merge this now after our discussion yesterday. |
Now that the external citation detector is capable of receiving a hash of features, and returning a boolean response predicting a citation, we can integrate the TACOS application with that service.
This achieves this integration via a few steps:
@features.Detector::MlCitation, which takes those features, assembles a payload, and communicates with the lambda. It then processes the response, and registers positive results as detections. This detector is guarded by a defensive clause,.expected_env?, which confirms that three environment variables are populated prior to taking any real actions.One thing to note about this feature is that the needed env vars are described as optional, and kept out of
.env.test. Adding them will require that a large number of tests that don't relate to this feature will now need cassettes, and a large number of tests in the future will likewise need cassettes unrelated to the feature under test. Instead, this defines a helper function that allows only relevant tests to define those env vars. We're using the helper clause pattern rather than a setup method because a few different test suites will occasionally need this feature - so this minimizes code repetition.I'm not sure whether this is relevant, but I also want to note that the external ml_citation detector currently always returns True, because it doesn't yet have the algorithm implemented. As a result, there's a test added here that is skipped because it will rely on the lambda being able to return either true or false.
Extra questions:
Citationrecord at a later step.Developer
Ticket(s)
https://mitlibraries.atlassian.net/browse/TCO-183
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
NO migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing