Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented May 7, 2025

Purpose and background context

This PR removes feature flags from TIM.

Before the parquet dataset, indexing and deleting were performed separately via CLI commands:

  • bulk-index
    • expected a JSON file with an array of objects, each object a TIMDEX record
  • bulk-delete
    • expected a JSON file with an array of objects, each object containing a timdex_record_id to remove from the index

Now that we are using the parquet dataset, which includes an action column indicating if that record should be indexed or deleted, the need for distinct CLI commands is not needed. Now a single bulk-update command exists, expecting a run_id which provides a subset of records from the parquet dataset to index and delete from Opensearch.

Note the commit 280262f which updates the VCR cassettes for indexing and deleting records. This commit has lots of file churn, but is really just re-recording the VCR cassettes to use the parquet dataset as input vs JSON + TXT files.

How can a reviewer manually see the effects of these changes?

1- Set AWS Dev credentials for TimdexManagers

2- Set env vars:

WORKSPACE=dev
TIMDEX_OPENSEARCH_ENDPOINT=search-timdex-dev-fgby3dckzlfmni2len2wbdf444.us-east-1.es.amazonaws.com

3- Run bulk-update command:

pipenv run tim bulk-update \
--run-date 2025-05-07 \
--run-id 85cfe316-089c-4639-a5af-c861a7321493 \
--index libguides-2025-05-07t20-45-01 \
s3://timdex-extract-dev-222053980223/dataset

Successful StepFunction run with feature flag removed TIM code in Dev: https://222053980223-lritm6hn.us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/v2/executions/details/arn:aws:states:us-east-1:222053980223:execution:timdex-ingest-dev:85cfe316-089c-4639-a5af-c861a7321493.

Includes new or updated dependencies?

YES | NO

Changes expectations for external applications?

YES | NO

What are the relevant tickets?

  • Include links to Jira Software and/or Jira Service Management tickets here.

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples have been verified
  • New dependencies are appropriate or there were no changes

ghukill added 3 commits May 6, 2025 10:18
Why these changes are being introduced:

As of pipenv 2025.0.1 the use of `pipenv check` would throw
an error, indicating that the library `safety` was not installed.
It worked to run `pipenv check --auto-install` which would
temporarily install `safety`, but this was not ideal for multiple
reasons.

First, we anticipate potentially moving away from `pipenv`.

Second, it appears that `safety` is moving to a pay / subscription
model.

Third, it remains a little obfuscated what `pipenv check` is actually
doing.

As this new situation affects all builds in Github Actions CI,
we need a way to scan for vulnerabilities that ideally is not
a massive overhaul of our vulnerability scanning approach.

How this addresses that need:

`pip-audit` is a nice standalone, open-source library that
performs very similar work to `safety`.

This commit replaces `pipenv check` (which was `safety` under
the hood) with `pip-audit`.

Side effects of this change:
* Builds will be successful in Github Actions

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1240
Why these changes are being introduced:

During development work of establishing the parquet dataset
architecture for TIMDEX ETL, feature flags were used in
various applications to remain backwards compatible.  Now
that the parquet dataset architecture is established, the
feature flags are no longer needed.

How this addresses that need:
* Removes 'bulk-index' and 'bulk-delete' CLI commands, preferring
only 'bulk-update' which expects a subset of records from the parquet
dataset to index and delete (per the 'action' column in the dataset).

Side effects of this change:
* Indexing from a JSON array of TIMDEX Record objects is no longer
supported, neither is a deleting from a JSON array of objects
with a single 'timdex_record_id' field.  All indexing and deleting
is expected to originate from parquet dataset records.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-491
Why these changes are being introduced:

With the removal of CLI commands bulk-index and bulk-delete, the test
suite needed some updates.

How this addresses that need:
* Recreate VCR cassettes for bulk indexing and deleting by using
local parquet dataset fixtures
* Remove helpers no longer needed; parsing previous JSON and TXT
files for input

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-491
@coveralls
Copy link

Pull Request Test Coverage Report for Build 14909955215

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 7 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-1.8%) to 97.783%

Files with Coverage Reduction New Missed Lines %
tim/helpers.py 1 98.36%
tim/opensearch.py 6 96.45%
Totals Coverage Status
Change from base Build 13315645537: -1.8%
Covered Lines: 397
Relevant Lines: 406

💛 - Coveralls

@ghukill ghukill marked this pull request as ready for review May 8, 2025 15:24
@ghukill ghukill requested a review from a team May 8, 2025 15:24
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works as expected!

@ghukill ghukill merged commit 719df76 into main May 9, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants