Skip to content

Conversation

@matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Jun 2, 2025

Now that the external citation detector is capable of receiving a hash of features, and returning a boolean response predicting a citation, we can integrate the TACOS application with that service.

This achieves this integration via a few steps:

  1. Feature extraction is handled by the legacy citation detector, by adding a fourth instance variable, @features.
  2. We add a new, parallel detector to the application, Detector::MlCitation, which takes those features, assembles a payload, and communicates with the lambda. It then processes the response, and registers positive results as detections. This detector is guarded by a defensive clause, .expected_env?, which confirms that three environment variables are populated prior to taking any real actions.
  3. The term model includes a consultation of the new detector, leveraging that guard clause so that we can deploy this code without the lambda yet being deployed.

One thing to note about this feature is that the needed env vars are described as optional, and kept out of .env.test. Adding them will require that a large number of tests that don't relate to this feature will now need cassettes, and a large number of tests in the future will likewise need cassettes unrelated to the feature under test. Instead, this defines a helper function that allows only relevant tests to define those env vars. We're using the helper clause pattern rather than a setup method because a few different test suites will occasionally need this feature - so this minimizes code repetition.

I'm not sure whether this is relevant, but I also want to note that the external ml_citation detector currently always returns True, because it doesn't yet have the algorithm implemented. As a result, there's a test added here that is skipped because it will rely on the lambda being able to return either true or false.

Extra questions:

  1. If it helps, the approach adopted here was not the first pattern I considered. An alternative approach, using a lookup class rather than a detector class, can be found in DO NOT MERGE - lambda integration via lookup pattern #245 .
  2. This PR adds a new detector, leaving the legacy citation detector in place. I'm not sure this is needed - we build the DETECTOR_VERSION env var into the application to enable significant changes to detector behavior, which this change arguably fulfills. The change in this PR is a more incremental step toward that way of working, so I'm happy to propose this small step that still leaves the opportunity to move this new detector into the existing Citation record at a later step.

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-183

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 2, 2025 13:37 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 2, 2025 13:45 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 3, 2025 20:54 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 14:35 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 16:35 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 16:45 Inactive
** Why are these changes being introduced:

In order to integrate TACOS with the external lambda, we need to send a
hash of the extracted features. This is done implicitly within the
existing citation detector, but not exposed in the format the lambda
will need.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-183

** How does this address that need:

This adds a fourth instance variable to the citation detector: @features
This variable is in the format needed by the lambda, and is derived from
the existing variables @subpatterns and @summary.

There is a new private method which handles this compilation, which does
a semi-complex bit of manipulation. I've tried to comment its operation
thoroughly.

** Document any side effects to this change:

The instance variables provided by the citation detector now have some
duplication, as everything in the @summary variable is also contained
in the @features variable. Ultimately we'll probably be able to remove
the older instance variables, but that will be a separate refactor.
** Why are these changes being introduced:

We need a detector class that will handle communication with the new
external machine learning algorithm for citation detection.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-183

** How does this address that need:

This creates a new detector, detector::ml_citation. Along with this
class, we create seeds and fixtures which treat this as a _separate_
detector from the existing citation detector (see side effects).

As of this commit, the detector is not integrated with the rest of the
application, as the term model's record_detections method has not yet
been changed.

There are also tests, along with three VCR cassettes which simulate its
responses in the test suite. Please note, though, that the detector
requires a trio of environment variables to be set in order to do any
work. Those variables have not been set in .env.test, because that will
cause a large number of tests that have nothing to do with this feature
to now need cassettes. Instead, there is a helper function that needs
to be called for any test that needs the new ML citation detector.

For local development, you'll want to have the running lambda on your
laptop somewhere (see the tacos-dtectors-lambdas repo for instructions).

** Document any side effects to this change:

I'm not sure we will want this to be a separate / parellel detector, or
if this should be a replacement for the existing citation detector while
updating the app's DETECTOR_VERSION variable. This sort of significant
upgrade to functionality is the sort of use case we envisioned for that
variable in the first place, because having this new capability should
mean that we re-process terms that had already been evaluated and
categorized.
** Why are these changes being introduced:

The term model will need to interact with the new detector, but only
when we are ready to use it.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-183

** How does this address that need:

This adds the call to Detector::MlCitation.record(term) in the term
model's record_detections, model, but guarded by a check of the
environment variables.

As I write this, the detector has not been deployed to AWS, so that
guard is relied upon heavily. It is also why these three env vars have
not been defined anywhere in Heroku.

There is one new test on the term model confirming both sides of this
guard clause.

** Document any side effects to this change:

None, I don't think.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 4, 2025 17:41 Inactive
@matt-bernhardt matt-bernhardt changed the title REWORD Integrate with external citation detector via Detector::MlCitation class Jun 4, 2025
@JPrevost JPrevost self-assigned this Jun 4, 2025
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. It's a bit hard to test without something to call locally but I think I'm okay with that for now and I'll spend more time on the integration aspects when the lambda is a bit further along.

Not all of my comments require change, but I suspect something in them will drive you to want to make small tweaks. If not, let me know and I'll switch the status to approved.


# This converts the already-built @subpatterns and @summary instance variables into the @features instance variable,
# which has a format suitable for sending to our prediction algorithm.
def extract_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future refactor to consider (not recommending we take this on now).

I'm curious why instance variables and attr_accessors are being used rather than just making this a public method and renaming it features. The reason I suggest we not change this now (even if you agree with it, which you might not) is that other aspects of this model are using instance_variables and attr_accessors just like this instead of moving some methods to public so we'd be refactor a fair bit... which might be better to do when we either refactor the application to be more feature centric or decide not to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is an option to consider, but agree with you that it should be delayed a bit. I'd be very interested in exploring the refactor to shift the feature extraction a bit, and considering this option in that context.

pattern_checker(phrase)
summarize(phrase)
extract_features
@subpatterns.delete_if { |_, v| v == [] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do this cleanup after we use that data in the extract_features method and not before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is needed because we're dropping the conditional in on line 144 (the next comment). The current sequence only populates @subpatterns for cases where there's a match:

{lastnames: ["Hello,"]}

However, for the @features variable we need those unmatched cases to be zeros.

With this PR, the initial state of @subpatterns would be

{apa_volume_issue: [], no: [], pages: [], pp: [], vol: [], year_parens: [], brackets: [], lastnames: ["Hello,"], quotes: []}

In that state, the extract_features command can very easily convert that to
{apa_volume_issue: 0, no: 0, pages: 0, pp: 0, vol: 0, year_parens: 0, brackets: 0, lastnames: 1, quotes: 0}

(and then merge that with the @summary variable for the full set of features).

This line then completes the cleanup of @subpatterns, and gets it back to just

{lastnames: ["Hello,"]}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit awkward, but I'll willing to consider it as such right now and move on :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a middle-ground state, and should be simplified in the near future. While the detector lambda isn't actually in service because we're working through the deploy process (and the ticket to implement the prediction capability isn't complete), we need to preserve the legacy citation detector.

Once the new system is fully in place, we will be free to refactor the feature extraction to not use this type of approach.

def pattern_checker(phrase)
CITATION_PATTERNS.each_pair do |type, pattern|
@subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present?
@subpatterns[type.to_sym] = scan(pattern, phrase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find the previous conditional did nothing useful here? I have no concerns specifically with removing the conditional but I'm curious if you realized it was never needed or if something changed to no longer need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this just above about the new @subpatterns.delete_if { ... } block, but - no, this conditional was exactly what we needed at the time, but it was preventing me from generating the full set of features for cases that the conditional blocked. So I'm proposing we remove the check from this method, and instead perform the cleanup a step later, after we've populated the @features variable via extract_features.

external_data = fetch(phrase)
return if external_data == 'Error'

@detections = external_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit confusing to me as I found myself scanning for a method external_data rather than looking above for the variable.

Would you consider something like:

@detections = fetch(phrase)
return if external_data == 'Error'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give that a shot - I like the shorter method, and this may also allow us to do better error handling. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this comment thread as a bit of code therapy, but may end up saying some of this in the ticket if it feels like it'd be useful long-term.

That specific arrangement doesn't work because external_data never gets populated, so the return clause can't do anything. I could change it to return if @detections == 'Error', but... without anything else in the method I'm not sure what that would accomplish. The entire method could just be

def initialize(phrase)
  return unless self.class.expected_env?
  
  @detections = fetch(phrase)
end

That has the advantage of being nice and simple, so maybe that's the winner. What I'm proposing initially, though, tries to prevent errors from being passed upward - hence the early return prior to @detections being populated in case something goes wrong getting a response back from the lambda. I'm not sure that the Term model or anything farther up the application needs to know that the communication with the citation detector went haywire?

The categorization process, the record-keeping in the term model, and the GraphQL response can all do their work without any input from this detector, so my thinking initially was that this detector either communicates a result back, or it handles errors on its own (with the coming addition of sending them to Sentry).

If that makes sense to you, I'm open to trying something like this?

    def initialize(phrase)
      return unless self.class.expected_env?

      return if fetch(phrase) == 'Error'

      @detections = fetch(phrase)
    end

(This may yet need some work, as trying it locally causes three tests to fail with cassette problems. But it should be manageable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I think I see why the cassette problems happen. I was anticipating maybe there being some caching magic in place that would prevent multiple calls to fetch(phrase) from actually hitting the lambda twice - but that doesn't seem to be the case.

That leaves me with wanting:

  1. Hit the lambda once, and check the result.
  2. Return quickly, changing nothing, if that result is anything other than a valid response
  3. Populate the @detections instance variable if the result is a valid response

Ooh... I think I've got it. I'll add the new option to my next commit.

end

def self.lambda_url
ENV.fetch('DETECTOR_LAMBDA_URL', nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these three lambda methods get called anywhere outside this method? They feel like that should move to the private methods if not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't. I wanted them to be private methods, but was having problems getting them referenced as private methods in some cases. the .expected_env? method that references these (among other references in this model) gets called both as instance and class methods, and making this public was the only way I could make sure that they were reliably available in all cases.

I'm not particularly happy about this arrangement, so I'm happy to see if I missed something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the TL;DR is that it turns out they need to be public. I've tried making them private, and tried making them protected. Just moving them below the private declaration seems like it works, but CodeClimate and Rubocop point out that this isn't the way to make private class methods. Using the actual private_class_method prefix makes the linter happy, but then tests fail galore.

Ditto with trying a protected method.

The issue seems to boil down to the variety of ways that this method gets called. Externally, everything gets called as class methods - but then within the record method, the class dog-foods itself via instantiation, and the initialize method in turn calls expected_env? - which means that method gets used in both contexts, which means that the methods it calls being private invites a world of pain and suffering.

The sound you hear is me forcing myself not to just refactor all of our detector classes to avoid this dual-context situation. It probably needs to get added to our list of tech debt, but for right now having these be public methods is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.

end
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should leave this helper in the ml_citation_test file for now until it is needed elsewhere. It seems unlikely that we'd actually need this helper anywhere else so it feels a bit more complex to have to look here rather than just having it at the top of the file being tested? This is not a strong feeling so if you feel it is better here that works for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. You use it it both ml_citation_test and term_test already. I'm leaving my comment so you can see what I was thinking... however, I now think this is the correct location for this helper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - if it weren't for this test (and the realization that more tests at this model might rely on that method), I'd have implemented it in ml_citation_test.

end
end

# These next three tests may not be needed, since the expected_env? method is tested above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment feels a bit fragile as someone could insert a test in between any of these at some point. Maybe either drop the tests you feel are redundant or add the a more focused comment to each test such as:
This test may not be needed, since the expected_env? method is also tested explicitly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I'll take a fresh look at these three tests, and at least wrap them individually, if not remove them.

end
end

test 'lookup returns nil when challenge_secret is wrong' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is doing a lot of work by handling "all" of the possible error states. I'm not suggesting we test every possible error that could come back, but I think a test that has a name that is clear it is testing non-200 status would be great.

test 'non 200 http status responses result in no detection' do
...
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly devoted to the name I used for this test, but it might be good to talk about the way you're interpreting this? I had thought that the name I used was clear for what was being tested, and that this wasn't a general "text all the error states" type of test.

In fact, I'm a little worried about whether I'm ignoring too many error states with this PR, or whether error handling is implemented thoroughly enough.

@matt-bernhardt
Copy link
Member Author

There's some good points for improvement in here, so I'll be tweaking some things based on your feedback. I've also tried to answer some of the questions you've asked - but there's one discussion about the error handling test(s) that might be good to have this afternoon?

In the meantime, I'll make some tweaks and let you know when it's ready for re-review.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 6, 2025 19:12 Inactive
This responds to several of the notes received during code review. In
no particular order:

- The initialize mehod is refactored to be somewhat shorter, with better
variable names and hopefully a clearer method of handling errors
received from the fetch method.

- Sentry logging is explicitly invoked during error handling from the
lambda. I'm trying the .set_extras method here to see how that behaves.

- Three likely-unnecessary tests have been removed, because the public
method .expected_env? is already tested elsewhere.

- The error handling test is renamed to cover all non-200 responses,
rather than just the specific means used to trigger a non-200 response.
Later work to differentiate between non-200 responses is still available
should we want it.

Still to-do is to dig in on the env-reading class methods, to see why
things go sideways when I make them private.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-243 June 6, 2025 19:27 Inactive
@matt-bernhardt
Copy link
Member Author

Ok, I think I've gotten the changes made that feel feasible without picking at some very dangerous threads - and it's ready for re-review on Monday.

@matt-bernhardt matt-bernhardt requested a review from JPrevost June 6, 2025 22:12
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit more technical debt being committed here than ideal for a brand new class, but it feels better than holding off until we change all the things. 🤞🏻

pattern_checker(phrase)
summarize(phrase)
extract_features
@subpatterns.delete_if { |_, v| v == [] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit awkward, but I'll willing to consider it as such right now and move on :)

end

def self.lambda_url
ENV.fetch('DETECTOR_LAMBDA_URL', nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. There is no harm other than confusing ourselves later in having them public methods if things get weird with private class methods. If they aren't private, I'd suggest we add docstrings. I know they are fairly well named and simple so if you feel docs makes them worse, I won't push hard on this and am actually approving the PR as-is without this change.

@matt-bernhardt
Copy link
Member Author

I've added docstrings to those three must-be-public methods in the ml_citation model, with a note about why they need to be public. I'm going to merge this now after our discussion yesterday.

@matt-bernhardt matt-bernhardt merged commit 2afd833 into main Jun 12, 2025
6 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-183 branch June 12, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants