TIMX 495 - add new reindex-source CLI command #365

ghukill · 2025-06-04T17:23:42Z

Purpose and background context

Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using only ETL dataset data.

This command could be helpful for development, where it would be handy to see the current state of a source in a local Opensearch instance. Or, it could be helpful for disaster recovery where we want to refresh a source in Opensearch without performing the extract + transform stages again.

How this addresses that need:

Adds new reindex-source CLI command
- create new index
- promote and alias the new index
- read current records for a source from dataset and bulk index them

This CLI command is not really doing anything novel, just chaining pre-existing functionality and running in a particular order.

How can a reviewer manually see the effects of these changes?

1- Follow the instructions to get Opensearch running locally, e.g.

Set .env file:

OPENSEARCH_INITIAL_ADMIN_PASSWORD=SuperSecret42!

Run with docker compose:

docker pull opensearchproject/opensearch:latest
docker pull opensearchproject/opensearch-dashboards:latest
docker compose up

2- Set AWS prod TimdexManagers role credentials (we will only be reading)

3- Perform a full refresh of dspace locally:

pipenv run tim --verbose reindex-source \
-s dspace \
s3://timdex-extract-prod-300442551476/dataset

The source dspace could be switched out for any other sources and work as-is. Note there is not need to think about dates, index names, or anything; the CLI command is handling all this and ensuring that a new index alias is available for that source with all current records.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-495

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples have been verified
New dependencies are appropriate or there were no changes

ghukill · 2025-06-04T17:24:20Z

tim/cli.py

+    index_results = {"created": 0, "updated": 0, "errors": 0, "total": 0}
+
+    td = TIMDEXDataset(location=dataset_path)
+    td.load(current_records=True, source=source)


Here is where we use the new load(current_records=True).

ghukill · 2025-06-04T17:25:11Z

tim/cli.py

+    td.load(current_records=True, source=source)
+
+    # bulk index records
+    records_to_index = td.read_transformed_records_iter(action="index")


And here, we limit to only action="index" for bulk indexing. We do not need action="delete" records, because if they are the current version, they don't/shouldn't exist in Opensearch!

NOTE: this was actually what revealed the bug where non-current records could be yielded if filtering was applied, as it is here. This has been resolved.

JPrevost

Functionally, I love it.

Some thoughts.

Outside the scope of this PR, I'd love a command or hint in the existing cmd help that shows available sources. I guessed that aspace was a source instead of archivesspace but I could have guessed the other way for example

Can we go further into not having to think and allow keywords like prod/stage/dev and have TIM understand the S3 path to the dataset? I suspect "yes, but we'd need to code that path in a way that would go against our not publishing s3 bucket names so it's a bad idea". But it would be cool :)

ghukill · 2025-06-04T19:12:12Z

@JPrevost

Functionally, I love it.

Whoo-hoo!

Outside the scope of this PR, I'd love a command or hint in the existing cmd help that shows available sources. I guessed that aspace was a source instead of archivesspace but I could have guessed the other way for example

I copy pasted the --source CLI argument from another command, and just now realized it's actually a click.Choice() with a list of valid options, which results in this:

$ pipenv run tim --verbose reindex-source --help

                                                                                                                                            
 Usage: tim reindex-source [OPTIONS] DATASET_PATH                                                                                           
                                                                                                                                            
 Perform a full refresh for a source in Opensearch for all current records.                                                                 
 This CLI command performs the following:     1. creates a new index for the source     2. promotes this index as the primary for the       
 source alias, and added to any other     aliases passed (e.g. 'timdex')     3. uses the TDA library to yield only current records from the 
 parquet dataset     for the source     4. bulk index these records to the new Opensearch index                                             
 The net effect is a full refresh for a source in Opensearch, ensuring only current, non-deleted versions of records are used from the      
 parquet dataset.                                                                                                                           
                                                                                                                                            
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --source  -s  [alma|aspace|dspace|gismit|gisogm|libguides|jpal|researchd  TIMDEX Source to fully reindex in Opensearch. [required]    │
│                  atabases|whoas|zenodo]                                                                                                  │
│    --alias   -a  TEXT                                                        Alias to promote the index to in addition to the primary    │
│                                                                              alias. May be repeated to promote the index to multiple     │
│                                                                              aliases at once.                                            │
│    --help    -h                                                              Show this message and exit.                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Can we go further into not having to think and allow keywords like prod/stage/dev and have TIM understand the S3 path to the dataset? I suspect "yes, but we'd need to code that path in a way that would go against our not publishing s3 bucket names so it's a bad idea". But it would be cool :)

Agreed.

This was a bit of an artifact of the previous architecture where the pipeline lambda told TIM where to find the precise JSON file(s) and/or TXT file(s) it would be indexing from. Now, TIM just needs to know where the dataset is and what the run_id is.

As such, we could evolve this further:

require an env var like TIMDEX_DATASET_LOCATION
have pipeline lambda stop passing the location

This doesn't really help your point about saying "prod" or "dev", but kind of explains why it's present in the CLI command in the first place.

I think it gets weird when TIM is aware of different environments. But, once the AWS credentials are set, I think it could be reasonable to say, "for this environment, where is the TIMDEX dataset?" We could store that in a secret, and we could likely use that throughout the ETL pipeline.

It's tangential to what you've raised here, but I think very much related.

JPrevost · 2025-06-04T19:18:01Z

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Yes. Love it.

edit: I wanted to acknowledge that you are showing me that this already exists and is exactly what I was looking for :)

ghukill · 2025-06-04T19:33:24Z

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Yes. Love it.

edit: I wanted to acknowledge that you are showing me that this already exists and is exactly what I was looking for :)

FWIW, I genuinely had forgotten it was already built-in. Even small things like if the PR notes suggested first running the --help that would a) increase awareness it's there and b) serve to include it's helpfulness in code review.

jonavellecuerdo · 2025-06-05T20:50:46Z

@ghukill The changes look good and I approved the PR, but I was wondering if you could make a small update in the commit message. Currently it reads:

Now that the TDA library can yield only the current records for a source from
the parquet dataset, this opens up the opporunity for a fairly simple CLI command
that performs a full refresh of a source in Opensearch using only ETL dataset
data.

I was a bit thrown off by the phrasing (highlighted in bold) 🤔 because technically any records in OpenSearch are always pulled from the ETL dataset, but what I gathered it intended to say was

...performs a full refresh of a source in OpenSearch by simply retrieving records from the ETL dataset without having to run the ETL pipeline.

While this is mentioned in the third sentence of the commit, the first sentence could be clarified a bit!

Also, in the event that we do have to reindex a source, is the method of executing the command the way it is outlined in the PR description -- running pipenv run tim reindex-source in a terminal with AWS credentials? 🤔

Why these changes are being introduced: Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using data already present in the ETL dataset data, no extract or transform work is required. This command could be helpful for development, where it would be handy to see the current state of a source in a local Opensearch instance. Or, it could be helpful for disaster recovery where we want to refresh a source in Opensearch without performing the extract + transform stages again. How this addresses that need: * Adds new reindex-source CLI command * create new index * promote and alias the new index * read current records for a source from dataset and bulk index them Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-495

coveralls · 2025-06-06T12:57:45Z

Pull Request Test Coverage Report for Build 15491041464

Details

21 of 23 (91.3%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 97.436%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tim/cli.py	21	23	91.3%

Totals
Change from base Build 15427147015:	-0.3%
Covered Lines:	418
Relevant Lines:	429

💛 - Coveralls

ghukill · 2025-06-06T12:58:58Z

I was a bit thrown off by the phrasing (highlighted in bold) 🤔 because technically any records in OpenSearch are always pulled from the ETL dataset, but what I gathered it intended to say was

...performs a full refresh of a source in OpenSearch by simply retrieving records from the ETL dataset without having to run the ETL pipeline.

While this is mentioned in the third sentence of the commit, the first sentence could be clarified a bit!

Happy to update - forced pushed an updated commit message. Hope the following reads a bit clearer:

Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using data already present in the ETL dataset data, no extract or transform work is required.

Per your other question,

Also, in the event that we do have to reindex a source, is the method of executing the command the way it is outlined in the PR description -- running pipenv run tim reindex-source in a terminal with AWS credentials? 🤔

Yep! It's invoked just like any other CLI command for this application. I think it remains to be seen how / when / why we use thsi functionality, but at the very least a developer could perform the operation in a pinch.

JPrevost · 2025-06-06T13:15:28Z

One thing I'm interested in exploring with this type of command in place is the ability to copy the parquet dataset across environments and reload data into stage/dev/local without having to regenerate the data. This will come in particularly helpful if we end up storing text embeddings in the dataset as they will be expensive to regenerate. If we've already calculated them in production, and just want them in stage for some testing of something, s3 copy then reloading a source should be a huge time and resource savings. That may not be the intent of this command, but it starts to make it easier to do things like that.

ghukill · 2025-06-06T13:30:00Z

One thing I'm interested in exploring with this type of command in place is the ability to copy the parquet dataset across environments and reload data into stage/dev/local without having to regenerate the data. This will come in particularly helpful if we end up storing text embeddings in the dataset as they will be expensive to regenerate. If we've already calculated them in production, and just want them in stage for some testing of something, s3 copy then reloading a source should be a huge time and resource savings. That may not be the intent of this command, but it starts to make it easier to do things like that.

100% agree. I can think of hacky ways to copy data across AWS accounts, but it would be valuable to look at efficient, performant, and cost effective ways to do that. Then, yeah, this command in TIM would allow for quick indexing of data from that ETL dataset.

Getting back to @jonavellecuerdo's question about how this is invoked, I think there could be value in an AWS assets -- e.g. a StepFunction -- that would bulk reindex all sources. That could happen in parallel without any issue, probably getting down into the 20-25 minutes for a full refresh in a target Opensearch instance.

JPrevost · 2025-06-06T13:41:34Z

Imagine a Flask app in a lambda with an option to choose a dataset from any tier (dev/stage/prod) and reindex selectable sources into any tier that is at or below the dataset (prod can load into prod/stage/dev; stage can load into stage/dev; dev can load into dev). 🤔

Yes, I can over engineer anything if given sufficient time.

ehanson8

Looks great! Functionality is very easy to understand

ehanson8 · 2025-06-06T13:11:00Z

tim/cli.py

+) -> None:
+    """Perform a full refresh for a source in Opensearch for all current records.
+
+    This CLI command performs the following:


Perfect overview of the command!

ghukill commented Jun 4, 2025

View reviewed changes

ghukill marked this pull request as ready for review June 4, 2025 17:26

ghukill requested review from JPrevost, ehanson8 and jonavellecuerdo June 4, 2025 17:26

JPrevost reviewed Jun 4, 2025

View reviewed changes

jonavellecuerdo approved these changes Jun 5, 2025

View reviewed changes

ghukill force-pushed the TIMX-495-full-reindex-cli-command branch from 52a5d5d to 75412fc Compare June 6, 2025 12:55

ehanson8 approved these changes Jun 6, 2025

View reviewed changes

ghukill merged commit c0ce8ba into main Jun 6, 2025
3 checks passed

TIMX 495 - add new reindex-source CLI command #365

TIMX 495 - add new reindex-source CLI command #365

Uh oh!

Conversation

ghukill commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ghukill Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JPrevost left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPrevost commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghukill commented Jun 4, 2025

Uh oh!

jonavellecuerdo commented Jun 5, 2025

Uh oh!

coveralls commented Jun 6, 2025

Pull Request Test Coverage Report for Build 15491041464

Details

💛 - Coveralls

Uh oh!

ghukill commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPrevost commented Jun 6, 2025

Uh oh!

ghukill commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPrevost commented Jun 6, 2025

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ghukill commented Jun 4, 2025 •

edited

Loading

ghukill Jun 4, 2025 •

edited

Loading

ghukill commented Jun 4, 2025 •

edited

Loading

JPrevost commented Jun 4, 2025 •

edited

Loading

ghukill commented Jun 6, 2025 •

edited

Loading

ghukill commented Jun 6, 2025 •

edited

Loading