-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 495 - add new reindex-source CLI command #365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| index_results = {"created": 0, "updated": 0, "errors": 0, "total": 0} | ||
|
|
||
| td = TIMDEXDataset(location=dataset_path) | ||
| td.load(current_records=True, source=source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we use the new load(current_records=True).
| td.load(current_records=True, source=source) | ||
|
|
||
| # bulk index records | ||
| records_to_index = td.read_transformed_records_iter(action="index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here, we limit to only action="index" for bulk indexing. We do not need action="delete" records, because if they are the current version, they don't/shouldn't exist in Opensearch!
NOTE: this was actually what revealed the bug where non-current records could be yielded if filtering was applied, as it is here. This has been resolved.
JPrevost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, I love it.
Some thoughts.
Outside the scope of this PR, I'd love a command or hint in the existing cmd help that shows available sources. I guessed that aspace was a source instead of archivesspace but I could have guessed the other way for example
Can we go further into not having to think and allow keywords like prod/stage/dev and have TIM understand the S3 path to the dataset? I suspect "yes, but we'd need to code that path in a way that would go against our not publishing s3 bucket names so it's a bad idea". But it would be cool :)
Whoo-hoo!
I copy pasted the $ pipenv run tim --verbose reindex-source --help
Usage: tim reindex-source [OPTIONS] DATASET_PATH
Perform a full refresh for a source in Opensearch for all current records.
This CLI command performs the following: 1. creates a new index for the source 2. promotes this index as the primary for the
source alias, and added to any other aliases passed (e.g. 'timdex') 3. uses the TDA library to yield only current records from the
parquet dataset for the source 4. bulk index these records to the new Opensearch index
The net effect is a full refresh for a source in Opensearch, ensuring only current, non-deleted versions of records are used from the
parquet dataset.
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --source -s [alma|aspace|dspace|gismit|gisogm|libguides|jpal|researchd TIMDEX Source to fully reindex in Opensearch. [required] │
│ atabases|whoas|zenodo] │
│ --alias -a TEXT Alias to promote the index to in addition to the primary │
│ alias. May be repeated to promote the index to multiple │
│ aliases at once. │
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯Is that sufficient? I think the
Agreed. This was a bit of an artifact of the previous architecture where the pipeline lambda told TIM where to find the precise JSON file(s) and/or TXT file(s) it would be indexing from. Now, TIM just needs to know where the dataset is and what the As such, we could evolve this further:
This doesn't really help your point about saying "prod" or "dev", but kind of explains why it's present in the CLI command in the first place. I think it gets weird when TIM is aware of different environments. But, once the AWS credentials are set, I think it could be reasonable to say, "for this environment, where is the TIMDEX dataset?" We could store that in a secret, and we could likely use that throughout the ETL pipeline. It's tangential to what you've raised here, but I think very much related. |
Yes. Love it. edit: I wanted to acknowledge that you are showing me that this already exists and is exactly what I was looking for :) |
FWIW, I genuinely had forgotten it was already built-in. Even small things like if the PR notes suggested first running the |
|
@ghukill The changes look good and I approved the PR, but I was wondering if you could make a small update in the commit message. Currently it reads:
I was a bit thrown off by the phrasing (highlighted in bold) 🤔 because technically any records in OpenSearch are always pulled from the ETL dataset, but what I gathered it intended to say was
While this is mentioned in the third sentence of the commit, the first sentence could be clarified a bit! Also, in the event that we do have to reindex a source, is the method of executing the command the way it is outlined in the PR description -- running |
Why these changes are being introduced: Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using data already present in the ETL dataset data, no extract or transform work is required. This command could be helpful for development, where it would be handy to see the current state of a source in a local Opensearch instance. Or, it could be helpful for disaster recovery where we want to refresh a source in Opensearch without performing the extract + transform stages again. How this addresses that need: * Adds new reindex-source CLI command * create new index * promote and alias the new index * read current records for a source from dataset and bulk index them Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-495
52a5d5d to
75412fc
Compare
Pull Request Test Coverage Report for Build 15491041464Details
💛 - Coveralls |
Happy to update - forced pushed an updated commit message. Hope the following reads a bit clearer:
Per your other question,
Yep! It's invoked just like any other CLI command for this application. I think it remains to be seen how / when / why we use thsi functionality, but at the very least a developer could perform the operation in a pinch. |
|
One thing I'm interested in exploring with this type of command in place is the ability to copy the parquet dataset across environments and reload data into stage/dev/local without having to regenerate the data. This will come in particularly helpful if we end up storing text embeddings in the dataset as they will be expensive to regenerate. If we've already calculated them in production, and just want them in stage for some testing of something, s3 copy then reloading a source should be a huge time and resource savings. That may not be the intent of this command, but it starts to make it easier to do things like that. |
100% agree. I can think of hacky ways to copy data across AWS accounts, but it would be valuable to look at efficient, performant, and cost effective ways to do that. Then, yeah, this command in TIM would allow for quick indexing of data from that ETL dataset. Getting back to @jonavellecuerdo's question about how this is invoked, I think there could be value in an AWS assets -- e.g. a StepFunction -- that would bulk reindex all sources. That could happen in parallel without any issue, probably getting down into the 20-25 minutes for a full refresh in a target Opensearch instance. |
|
Imagine a Flask app in a lambda with an option to choose a dataset from any tier (dev/stage/prod) and reindex selectable sources into any tier that is at or below the dataset (prod can load into prod/stage/dev; stage can load into stage/dev; dev can load into dev). 🤔 Yes, I can over engineer anything if given sufficient time. |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Functionality is very easy to understand
| ) -> None: | ||
| """Perform a full refresh for a source in Opensearch for all current records. | ||
|
|
||
| This CLI command performs the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect overview of the command!
Purpose and background context
Now that the TDA library can yield only the current records for a source from the parquet dataset, this opens up the opporunity for a fairly simple CLI command that performs a full refresh of a source in Opensearch using only ETL dataset data.
This command could be helpful for development, where it would be handy to see the current state of a source in a local Opensearch instance. Or, it could be helpful for disaster recovery where we want to refresh a source in Opensearch without performing the extract + transform stages again.
How this addresses that need:
reindex-sourceCLI commandThis CLI command is not really doing anything novel, just chaining pre-existing functionality and running in a particular order.
How can a reviewer manually see the effects of these changes?
1- Follow the instructions to get Opensearch running locally, e.g.
Set
.envfile:Run with docker compose:
2- Set AWS prod
TimdexManagersrole credentials (we will only be reading)3- Perform a full refresh of
dspacelocally:The source
dspacecould be switched out for any other sources and work as-is. Note there is not need to think about dates, index names, or anything; the CLI command is handling all this and ensuring that a new index alias is available for that source with all current records.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)