Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jun 4, 2025

Purpose and background context

Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using only ETL dataset data.

This command could be helpful for development, where it would be handy to see the current state of a source in a local Opensearch instance. Or, it could be helpful for disaster recovery where we want to refresh a source in Opensearch without performing the extract + transform stages again.

How this addresses that need:

  • Adds new reindex-source CLI command
    • create new index
    • promote and alias the new index
    • read current records for a source from dataset and bulk index them

This CLI command is not really doing anything novel, just chaining pre-existing functionality and running in a particular order.

How can a reviewer manually see the effects of these changes?

1- Follow the instructions to get Opensearch running locally, e.g.

Set .env file:

OPENSEARCH_INITIAL_ADMIN_PASSWORD=SuperSecret42!

Run with docker compose:

docker pull opensearchproject/opensearch:latest
docker pull opensearchproject/opensearch-dashboards:latest
docker compose up

2- Set AWS prod TimdexManagers role credentials (we will only be reading)

3- Perform a full refresh of dspace locally:

pipenv run tim --verbose reindex-source \
-s dspace \
s3://timdex-extract-prod-300442551476/dataset

The source dspace could be switched out for any other sources and work as-is. Note there is not need to think about dates, index names, or anything; the CLI command is handling all this and ensuring that a new index alias is available for that source with all current records.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples have been verified
  • New dependencies are appropriate or there were no changes

index_results = {"created": 0, "updated": 0, "errors": 0, "total": 0}

td = TIMDEXDataset(location=dataset_path)
td.load(current_records=True, source=source)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we use the new load(current_records=True).

td.load(current_records=True, source=source)

# bulk index records
records_to_index = td.read_transformed_records_iter(action="index")
Copy link
Contributor Author

@ghukill ghukill Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here, we limit to only action="index" for bulk indexing. We do not need action="delete" records, because if they are the current version, they don't/shouldn't exist in Opensearch!

NOTE: this was actually what revealed the bug where non-current records could be yielded if filtering was applied, as it is here. This has been resolved.

@ghukill ghukill marked this pull request as ready for review June 4, 2025 17:26
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, I love it.

Some thoughts.

Outside the scope of this PR, I'd love a command or hint in the existing cmd help that shows available sources. I guessed that aspace was a source instead of archivesspace but I could have guessed the other way for example

Can we go further into not having to think and allow keywords like prod/stage/dev and have TIM understand the S3 path to the dataset? I suspect "yes, but we'd need to code that path in a way that would go against our not publishing s3 bucket names so it's a bad idea". But it would be cool :)

@ghukill
Copy link
Contributor Author

ghukill commented Jun 4, 2025

@JPrevost

Functionally, I love it.

Whoo-hoo!

Outside the scope of this PR, I'd love a command or hint in the existing cmd help that shows available sources. I guessed that aspace was a source instead of archivesspace but I could have guessed the other way for example

I copy pasted the --source CLI argument from another command, and just now realized it's actually a click.Choice() with a list of valid options, which results in this:

$ pipenv run tim --verbose reindex-source --help

                                                                                                                                            
 Usage: tim reindex-source [OPTIONS] DATASET_PATH                                                                                           
                                                                                                                                            
 Perform a full refresh for a source in Opensearch for all current records.                                                                 
 This CLI command performs the following:     1. creates a new index for the source     2. promotes this index as the primary for the       
 source alias, and added to any other     aliases passed (e.g. 'timdex')     3. uses the TDA library to yield only current records from the 
 parquet dataset     for the source     4. bulk index these records to the new Opensearch index                                             
 The net effect is a full refresh for a source in Opensearch, ensuring only current, non-deleted versions of records are used from the      
 parquet dataset.                                                                                                                           
                                                                                                                                            
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --source  -s  [alma|aspace|dspace|gismit|gisogm|libguides|jpal|researchd  TIMDEX Source to fully reindex in Opensearch. [required]    │
│                  atabases|whoas|zenodo]                                                                                                  │
│    --alias   -a  TEXT                                                        Alias to promote the index to in addition to the primary    │
│                                                                              alias. May be repeated to promote the index to multiple     │
│                                                                              aliases at once.                                            │
│    --help    -h                                                              Show this message and exit.                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Can we go further into not having to think and allow keywords like prod/stage/dev and have TIM understand the S3 path to the dataset? I suspect "yes, but we'd need to code that path in a way that would go against our not publishing s3 bucket names so it's a bad idea". But it would be cool :)

Agreed.

This was a bit of an artifact of the previous architecture where the pipeline lambda told TIM where to find the precise JSON file(s) and/or TXT file(s) it would be indexing from. Now, TIM just needs to know where the dataset is and what the run_id is.

As such, we could evolve this further:

  • require an env var like TIMDEX_DATASET_LOCATION
  • have pipeline lambda stop passing the location

This doesn't really help your point about saying "prod" or "dev", but kind of explains why it's present in the CLI command in the first place.

I think it gets weird when TIM is aware of different environments. But, once the AWS credentials are set, I think it could be reasonable to say, "for this environment, where is the TIMDEX dataset?" We could store that in a secret, and we could likely use that throughout the ETL pipeline.

It's tangential to what you've raised here, but I think very much related.

@JPrevost
Copy link
Member

JPrevost commented Jun 4, 2025

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Yes. Love it.

edit: I wanted to acknowledge that you are showing me that this already exists and is exactly what I was looking for :)

@ghukill
Copy link
Contributor Author

ghukill commented Jun 4, 2025

Is that sufficient? I think the type=click.Choice(VALID_SOURCES) in this project is a nice model, and honestly we should do that more to support this kind of CLI --help affordance.

Yes. Love it.

edit: I wanted to acknowledge that you are showing me that this already exists and is exactly what I was looking for :)

FWIW, I genuinely had forgotten it was already built-in. Even small things like if the PR notes suggested first running the --help that would a) increase awareness it's there and b) serve to include it's helpfulness in code review.

@jonavellecuerdo
Copy link
Contributor

@ghukill The changes look good and I approved the PR, but I was wondering if you could make a small update in the commit message. Currently it reads:

Now that the TDA library can yield only the current records for a source from
the parquet dataset, this opens up the opporunity for a fairly simple CLI command
that performs a full refresh of a source in Opensearch using only ETL dataset
data
.

I was a bit thrown off by the phrasing (highlighted in bold) 🤔 because technically any records in OpenSearch are always pulled from the ETL dataset, but what I gathered it intended to say was

...performs a full refresh of a source in OpenSearch by simply retrieving records from the ETL dataset without having to run the ETL pipeline.

While this is mentioned in the third sentence of the commit, the first sentence could be clarified a bit!

Also, in the event that we do have to reindex a source, is the method of executing the command the way it is outlined in the PR description -- running pipenv run tim reindex-source in a terminal with AWS credentials? 🤔

Why these changes are being introduced:

Now that the TDA library can yield only the current records for a source from
the parquet dataset, this opens up the opporunity for a fairly simple CLI command
that performs a full refresh of a source in Opensearch using data already present
in the ETL dataset data, no extract or transform work is required.

This command could be helpful for development, where it would be handy to see
the current state of a source in a local Opensearch instance.  Or, it could be
helpful for disaster recovery where we want to refresh a source in Opensearch
without performing the extract + transform stages again.

How this addresses that need:
* Adds new reindex-source CLI command
  * create new index
  * promote and alias the new index
  * read current records for a source from dataset and bulk index them

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-495
@ghukill ghukill force-pushed the TIMX-495-full-reindex-cli-command branch from 52a5d5d to 75412fc Compare June 6, 2025 12:55
@coveralls
Copy link

Pull Request Test Coverage Report for Build 15491041464

Details

  • 21 of 23 (91.3%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 97.436%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tim/cli.py 21 23 91.3%
Totals Coverage Status
Change from base Build 15427147015: -0.3%
Covered Lines: 418
Relevant Lines: 429

💛 - Coveralls

@ghukill
Copy link
Contributor Author

ghukill commented Jun 6, 2025

I was a bit thrown off by the phrasing (highlighted in bold) 🤔 because technically any records in OpenSearch are always pulled from the ETL dataset, but what I gathered it intended to say was

...performs a full refresh of a source in OpenSearch by simply retrieving records from the ETL dataset without having to run the ETL pipeline.

While this is mentioned in the third sentence of the commit, the first sentence could be clarified a bit!

Happy to update - forced pushed an updated commit message. Hope the following reads a bit clearer:

Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using data already present in the ETL dataset data, no extract or transform work is required.

Per your other question,

Also, in the event that we do have to reindex a source, is the method of executing the command the way it is outlined in the PR description -- running pipenv run tim reindex-source in a terminal with AWS credentials? 🤔

Yep! It's invoked just like any other CLI command for this application. I think it remains to be seen how / when / why we use thsi functionality, but at the very least a developer could perform the operation in a pinch.

@JPrevost
Copy link
Member

JPrevost commented Jun 6, 2025

One thing I'm interested in exploring with this type of command in place is the ability to copy the parquet dataset across environments and reload data into stage/dev/local without having to regenerate the data. This will come in particularly helpful if we end up storing text embeddings in the dataset as they will be expensive to regenerate. If we've already calculated them in production, and just want them in stage for some testing of something, s3 copy then reloading a source should be a huge time and resource savings. That may not be the intent of this command, but it starts to make it easier to do things like that.

@ghukill
Copy link
Contributor Author

ghukill commented Jun 6, 2025

One thing I'm interested in exploring with this type of command in place is the ability to copy the parquet dataset across environments and reload data into stage/dev/local without having to regenerate the data. This will come in particularly helpful if we end up storing text embeddings in the dataset as they will be expensive to regenerate. If we've already calculated them in production, and just want them in stage for some testing of something, s3 copy then reloading a source should be a huge time and resource savings. That may not be the intent of this command, but it starts to make it easier to do things like that.

100% agree. I can think of hacky ways to copy data across AWS accounts, but it would be valuable to look at efficient, performant, and cost effective ways to do that. Then, yeah, this command in TIM would allow for quick indexing of data from that ETL dataset.

Getting back to @jonavellecuerdo's question about how this is invoked, I think there could be value in an AWS assets -- e.g. a StepFunction -- that would bulk reindex all sources. That could happen in parallel without any issue, probably getting down into the 20-25 minutes for a full refresh in a target Opensearch instance.

@JPrevost
Copy link
Member

JPrevost commented Jun 6, 2025

Imagine a Flask app in a lambda with an option to choose a dataset from any tier (dev/stage/prod) and reindex selectable sources into any tier that is at or below the dataset (prod can load into prod/stage/dev; stage can load into stage/dev; dev can load into dev). 🤔

Yes, I can over engineer anything if given sufficient time.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Functionality is very easy to understand

) -> None:
"""Perform a full refresh for a source in Opensearch for all current records.

This CLI command performs the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect overview of the command!

@ghukill ghukill merged commit c0ce8ba into main Jun 6, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants