Integrate with external citation detector via Detector::MlCitation class #243

matt-bernhardt · 2025-06-02T13:35:22Z

Now that the external citation detector is capable of receiving a hash of features, and returning a boolean response predicting a citation, we can integrate the TACOS application with that service.

This achieves this integration via a few steps:

Feature extraction is handled by the legacy citation detector, by adding a fourth instance variable, @features.
We add a new, parallel detector to the application, Detector::MlCitation, which takes those features, assembles a payload, and communicates with the lambda. It then processes the response, and registers positive results as detections. This detector is guarded by a defensive clause, .expected_env?, which confirms that three environment variables are populated prior to taking any real actions.
The term model includes a consultation of the new detector, leveraging that guard clause so that we can deploy this code without the lambda yet being deployed.

One thing to note about this feature is that the needed env vars are described as optional, and kept out of .env.test. Adding them will require that a large number of tests that don't relate to this feature will now need cassettes, and a large number of tests in the future will likewise need cassettes unrelated to the feature under test. Instead, this defines a helper function that allows only relevant tests to define those env vars. We're using the helper clause pattern rather than a setup method because a few different test suites will occasionally need this feature - so this minimizes code repetition.

I'm not sure whether this is relevant, but I also want to note that the external ml_citation detector currently always returns True, because it doesn't yet have the algorithm implemented. As a result, there's a test added here that is skipped because it will rely on the lambda being able to return either true or false.

Extra questions:

If it helps, the approach adopted here was not the first pattern I considered. An alternative approach, using a lookup class rather than a detector class, can be found in DO NOT MERGE - lambda integration via lookup pattern #245 .
This PR adds a new detector, leaving the legacy citation detector in place. I'm not sure this is needed - we build the DETECTOR_VERSION env var into the application to enable significant changes to detector behavior, which this change arguably fulfills. The change in this PR is a more incremental step toward that way of working, so I'm happy to propose this small step that still leaves the opportunity to move this new detector into the existing Citation record at a later step.

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-183

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

@summary

** Why are these changes being introduced: In order to integrate TACOS with the external lambda, we need to send a hash of the extracted features. This is done implicitly within the existing citation detector, but not exposed in the format the lambda will need. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This adds a fourth instance variable to the citation detector: @features This variable is in the format needed by the lambda, and is derived from the existing variables @subpatterns and @summary. There is a new private method which handles this compilation, which does a semi-complex bit of manipulation. I've tried to comment its operation thoroughly. ** Document any side effects to this change: The instance variables provided by the citation detector now have some duplication, as everything in the @summary variable is also contained in the @features variable. Ultimately we'll probably be able to remove the older instance variables, but that will be a separate refactor.

** Why are these changes being introduced: We need a detector class that will handle communication with the new external machine learning algorithm for citation detection. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This creates a new detector, detector::ml_citation. Along with this class, we create seeds and fixtures which treat this as a _separate_ detector from the existing citation detector (see side effects). As of this commit, the detector is not integrated with the rest of the application, as the term model's record_detections method has not yet been changed. There are also tests, along with three VCR cassettes which simulate its responses in the test suite. Please note, though, that the detector requires a trio of environment variables to be set in order to do any work. Those variables have not been set in .env.test, because that will cause a large number of tests that have nothing to do with this feature to now need cassettes. Instead, there is a helper function that needs to be called for any test that needs the new ML citation detector. For local development, you'll want to have the running lambda on your laptop somewhere (see the tacos-dtectors-lambdas repo for instructions). ** Document any side effects to this change: I'm not sure we will want this to be a separate / parellel detector, or if this should be a replacement for the existing citation detector while updating the app's DETECTOR_VERSION variable. This sort of significant upgrade to functionality is the sort of use case we envisioned for that variable in the first place, because having this new capability should mean that we re-process terms that had already been evaluated and categorized.

** Why are these changes being introduced: The term model will need to interact with the new detector, but only when we are ready to use it. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-183 ** How does this address that need: This adds the call to Detector::MlCitation.record(term) in the term model's record_detections, model, but guarded by a check of the environment variables. As I write this, the detector has not been deployed to AWS, so that guard is relied upon heavily. It is also why these three env vars have not been defined anywhere in Heroku. There is one new test on the term model confirming both sides of this guard clause. ** Document any side effects to this change: None, I don't think.

JPrevost

This looks good. It's a bit hard to test without something to call locally but I think I'm okay with that for now and I'll spend more time on the integration aspects when the lambda is a bit further along.

Not all of my comments require change, but I suspect something in them will drive you to want to make small tweaks. If not, let me know and I'll switch the status to approved.

JPrevost · 2025-06-04T18:28:25Z

app/models/detector/citation.rb


+    # This converts the already-built @subpatterns and @summary instance variables into the @features instance variable,
+    # which has a format suitable for sending to our prediction algorithm.
+    def extract_features


Future refactor to consider (not recommending we take this on now).

I'm curious why instance variables and attr_accessors are being used rather than just making this a public method and renaming it features. The reason I suggest we not change this now (even if you agree with it, which you might not) is that other aspects of this model are using instance_variables and attr_accessors just like this instead of moving some methods to public so we'd be refactor a fair bit... which might be better to do when we either refactor the application to be more feature centric or decide not to do that.

Yeah, I think this is an option to consider, but agree with you that it should be delayed a bit. I'd be very interested in exploring the refactor to shift the feature extraction a bit, and considering this option in that context.

JPrevost · 2025-06-04T18:29:06Z

app/models/detector/citation.rb

      pattern_checker(phrase)
      summarize(phrase)
+      extract_features
+      @subpatterns.delete_if { |_, v| v == [] }


Is there a reason to do this cleanup after we use that data in the extract_features method and not before?

This cleanup is needed because we're dropping the conditional in on line 144 (the next comment). The current sequence only populates @subpatterns for cases where there's a match:

{lastnames: ["Hello,"]}

However, for the @features variable we need those unmatched cases to be zeros.

With this PR, the initial state of @subpatterns would be

{apa_volume_issue: [], no: [], pages: [], pp: [], vol: [], year_parens: [], brackets: [], lastnames: ["Hello,"], quotes: []}

In that state, the extract_features command can very easily convert that to
{apa_volume_issue: 0, no: 0, pages: 0, pp: 0, vol: 0, year_parens: 0, brackets: 0, lastnames: 1, quotes: 0}

(and then merge that with the @summary variable for the full set of features).

This line then completes the cleanup of @subpatterns, and gets it back to just

{lastnames: ["Hello,"]}

This feels a bit awkward, but I'll willing to consider it as such right now and move on :)

I think this is a middle-ground state, and should be simplified in the near future. While the detector lambda isn't actually in service because we're working through the deploy process (and the ticket to implement the prediction capability isn't complete), we need to preserve the legacy citation detector.

Once the new system is fully in place, we will be free to refactor the feature extraction to not use this type of approach.

JPrevost · 2025-06-05T18:08:23Z

app/models/detector/citation.rb

    def pattern_checker(phrase)
      CITATION_PATTERNS.each_pair do |type, pattern|
-        @subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present?
+        @subpatterns[type.to_sym] = scan(pattern, phrase)


Did you find the previous conditional did nothing useful here? I have no concerns specifically with removing the conditional but I'm curious if you realized it was never needed or if something changed to no longer need it.

I mentioned this just above about the new @subpatterns.delete_if { ... } block, but - no, this conditional was exactly what we needed at the time, but it was preventing me from generating the full set of features for cases that the conditional blocked. So I'm proposing we remove the check from this method, and instead perform the cleanup a step later, after we've populated the @features variable via extract_features.

JPrevost · 2025-06-05T18:12:23Z

app/models/detector/ml_citation.rb

+      external_data = fetch(phrase)
+      return if external_data == 'Error'
+
+      @detections = external_data


This was a bit confusing to me as I found myself scanning for a method external_data rather than looking above for the variable.

Would you consider something like:

@detections = fetch(phrase) return if external_data == 'Error'

Let me give that a shot - I like the shorter method, and this may also allow us to do better error handling. Thanks!

I'm using this comment thread as a bit of code therapy, but may end up saying some of this in the ticket if it feels like it'd be useful long-term.

That specific arrangement doesn't work because external_data never gets populated, so the return clause can't do anything. I could change it to return if @detections == 'Error', but... without anything else in the method I'm not sure what that would accomplish. The entire method could just be

def initialize(phrase) return unless self.class.expected_env? @detections = fetch(phrase) end

That has the advantage of being nice and simple, so maybe that's the winner. What I'm proposing initially, though, tries to prevent errors from being passed upward - hence the early return prior to @detections being populated in case something goes wrong getting a response back from the lambda. I'm not sure that the Term model or anything farther up the application needs to know that the communication with the citation detector went haywire?

The categorization process, the record-keeping in the term model, and the GraphQL response can all do their work without any input from this detector, so my thinking initially was that this detector either communicates a result back, or it handles errors on its own (with the coming addition of sending them to Sentry).

If that makes sense to you, I'm open to trying something like this?

def initialize(phrase) return unless self.class.expected_env? return if fetch(phrase) == 'Error' @detections = fetch(phrase) end

(This may yet need some work, as trying it locally causes three tests to fail with cassette problems. But it should be manageable)

Ah... I think I see why the cassette problems happen. I was anticipating maybe there being some caching magic in place that would prevent multiple calls to fetch(phrase) from actually hitting the lambda twice - but that doesn't seem to be the case.

That leaves me with wanting:

Hit the lambda once, and check the result.

Return quickly, changing nothing, if that result is anything other than a valid response

Populate the @detections instance variable if the result is a valid response

Ooh... I think I've got it. I'll add the new option to my next commit.

JPrevost · 2025-06-05T18:14:00Z

app/models/detector/ml_citation.rb

+    end
+
+    def self.lambda_url
+      ENV.fetch('DETECTOR_LAMBDA_URL', nil)


Do these three lambda methods get called anywhere outside this method? They feel like that should move to the private methods if not.

They don't. I wanted them to be private methods, but was having problems getting them referenced as private methods in some cases. the .expected_env? method that references these (among other references in this model) gets called both as instance and class methods, and making this public was the only way I could make sure that they were reliably available in all cases.

I'm not particularly happy about this arrangement, so I'm happy to see if I missed something.

Ok, the TL;DR is that it turns out they need to be public. I've tried making them private, and tried making them protected. Just moving them below the private declaration seems like it works, but CodeClimate and Rubocop point out that this isn't the way to make private class methods. Using the actual private_class_method prefix makes the linter happy, but then tests fail galore.

Ditto with trying a protected method.

The issue seems to boil down to the variety of ways that this method gets called. Externally, everything gets called as class methods - but then within the record method, the class dog-foods itself via instantiation, and the initialize method in turn calls expected_env? - which means that method gets used in both contexts, which means that the methods it calls being private invites a world of pain and suffering.

The sound you hear is me forcing myself not to just refactor all of our detector classes to avoid this dual-context situation. It probably needs to get added to our list of tech debt, but for right now having these be public methods is the way to go.

Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.

app/models/detector/ml_citation.rb

JPrevost · 2025-06-05T18:24:44Z

test/test_helper.rb

    end
  end
 end
+


I wonder if we should leave this helper in the ml_citation_test file for now until it is needed elsewhere. It seems unlikely that we'd actually need this helper anywhere else so it feels a bit more complex to have to look here rather than just having it at the top of the file being tested? This is not a strong feeling so if you feel it is better here that works for me.

Oh, I see. You use it it both ml_citation_test and term_test already. I'm leaving my comment so you can see what I was thinking... however, I now think this is the correct location for this helper.

Yep - if it weren't for this test (and the realization that more tests at this model might rely on that method), I'd have implemented it in ml_citation_test.

JPrevost · 2025-06-05T18:28:56Z

test/models/detector/ml_citation_test.rb

+      end
+    end
+
+    # These next three tests may not be needed, since the expected_env? method is tested above.


This comment feels a bit fragile as someone could insert a test in between any of these at some point. Maybe either drop the tests you feel are redundant or add the a more focused comment to each test such as:
This test may not be needed, since the expected_env? method is also tested explicitly

Yeah, that makes sense. I'll take a fresh look at these three tests, and at least wrap them individually, if not remove them.

JPrevost · 2025-06-05T18:36:14Z

test/models/detector/ml_citation_test.rb

+      end
+    end
+
+    test 'lookup returns nil when challenge_secret is wrong' do


This test is doing a lot of work by handling "all" of the possible error states. I'm not suggesting we test every possible error that could come back, but I think a test that has a name that is clear it is testing non-200 status would be great.

test 'non 200 http status responses result in no detection' do ... end

I'm not particularly devoted to the name I used for this test, but it might be good to talk about the way you're interpreting this? I had thought that the name I used was clear for what was being tested, and that this wasn't a general "text all the error states" type of test.

In fact, I'm a little worried about whether I'm ignoring too many error states with this PR, or whether error handling is implemented thoroughly enough.

matt-bernhardt · 2025-06-06T16:29:02Z

There's some good points for improvement in here, so I'll be tweaking some things based on your feedback. I've also tried to answer some of the questions you've asked - but there's one discussion about the error handling test(s) that might be good to have this afternoon?

In the meantime, I'll make some tweaks and let you know when it's ready for re-review.

This responds to several of the notes received during code review. In no particular order: - The initialize mehod is refactored to be somewhat shorter, with better variable names and hopefully a clearer method of handling errors received from the fetch method. - Sentry logging is explicitly invoked during error handling from the lambda. I'm trying the .set_extras method here to see how that behaves. - Three likely-unnecessary tests have been removed, because the public method .expected_env? is already tested elsewhere. - The error handling test is renamed to cover all non-200 responses, rather than just the specific means used to trigger a non-200 response. Later work to differentiate between non-200 responses is still available should we want it. Still to-do is to dig in on the env-reading class methods, to see why things go sideways when I make them private.

matt-bernhardt · 2025-06-06T22:12:33Z

Ok, I think I've gotten the changes made that feel feasible without picking at some very dangerous threads - and it's ready for re-review on Monday.

JPrevost

There's a bit more technical debt being committed here than ideal for a brand new class, but it feels better than holding off until we change all the things. 🤞🏻

JPrevost · 2025-06-09T19:09:42Z

app/models/detector/citation.rb

      pattern_checker(phrase)
      summarize(phrase)
+      extract_features
+      @subpatterns.delete_if { |_, v| v == [] }


This feels a bit awkward, but I'll willing to consider it as such right now and move on :)

JPrevost · 2025-06-09T19:19:34Z

app/models/detector/ml_citation.rb

+    end
+
+    def self.lambda_url
+      ENV.fetch('DETECTOR_LAMBDA_URL', nil)


Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.

matt-bernhardt · 2025-06-12T14:30:24Z

I've added docstrings to those three must-be-public methods in the ml_citation model, with a note about why they need to be public. I'm going to merge this now after our discussion yesterday.

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 2, 2025 13:37 Inactive

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 2, 2025 13:45 Inactive

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 3, 2025 20:54 Inactive

matt-bernhardt force-pushed the tco-183 branch from 61a2554 to adcf32b Compare June 4, 2025 14:35

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 14:35 Inactive

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 16:35 Inactive

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 16:45 Inactive

matt-bernhardt added 3 commits June 4, 2025 13:26

matt-bernhardt force-pushed the tco-183 branch from 99b6663 to 99636da Compare June 4, 2025 17:40

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 17:41 Inactive

matt-bernhardt mentioned this pull request Jun 4, 2025

DO NOT MERGE - lambda integration via lookup pattern #245

Closed

matt-bernhardt changed the title ~~REWORD~~ Integrate with external citation detector via Detector::MlCitation class Jun 4, 2025

matt-bernhardt requested review from JPrevost and jazairi June 4, 2025 17:57

JPrevost self-assigned this Jun 4, 2025

JPrevost requested changes Jun 5, 2025

View reviewed changes

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 6, 2025 19:12 Inactive

matt-bernhardt force-pushed the tco-183 branch from 7681abd to 6c58aa6 Compare June 6, 2025 19:27

mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 6, 2025 19:27 Inactive

matt-bernhardt requested a review from JPrevost June 6, 2025 22:12

JPrevost approved these changes Jun 9, 2025

View reviewed changes

Add docstrings to public class methods

2fa42d2

matt-bernhardt merged commit 2afd833 into main Jun 12, 2025
6 checks passed

matt-bernhardt deleted the tco-183 branch June 12, 2025 14:30

matt-bernhardt mentioned this pull request Jun 12, 2025

Integrate TACOS with external lambda #242

Closed

Integrate with external citation detector via Detector::MlCitation class #243

Integrate with external citation detector via Detector::MlCitation class #243

Uh oh!

Conversation

matt-bernhardt commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

Uh oh!

JPrevost left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt commented Jun 6, 2025

Uh oh!

matt-bernhardt commented Jun 6, 2025

Uh oh!

JPrevost left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt commented Jun 12, 2025

matt-bernhardt commented Jun 2, 2025 •

edited

Loading