-
Notifications
You must be signed in to change notification settings - Fork 0
USE 355 - index current embeddings for a source #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: The first pass at bulk updating pre-existing documents, encapsulated in the command `bulk-update-embeddings` required passing a `--run-id` to target a specific ETL run. This aligns with the most common use case of indexing embeddings within an ETL run. However, we have use cases now for indexing all current embeddings for a given source into Opensearch. These current embeddings may span multiple ETL runs. How this addresses that need: Updates the `bulk-update-embeddings` CLI command to require only `--source`, defaulting to retrieving all current embeddings for that source. This logic is identical to what `reindex-source` was already doing, but is decoupled from re-indexing the documents themselves which is not always required. While working on this, it was decided that raising an exception for a missing document when performing updates is not ideal. Some sources have indexing issues, and we have historically skipped those records. When we get to bulk updates, it's possible that we have embeddings for documents that were never indexed; we will log and skip them now in a similar fashion. Side effects of this change: * CLI supports ad-hoc indexing of all current embeddings for a source Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355
Pull Request Test Coverage Report for Build 21405528265Details
💛 - Coveralls |
Why these changes are being introduced: When bulk updating documents the result can be "noop" which means no operation was performed. This can happen if the update would have zero effect. How this addresses that need: Handle `result=noop` during bulk updates and set a new "skipped" result counter in the results. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355
|
@ghukill Can you clarify what the cause of those 16 errors were? 🤔 Are they records that were deleted since the last time we created embeddings for the |
Those errors happen each time we do a "full" run for But in the context of this PR, I think it revealed that for some reason we may occassionally have documents missing in Opensearch and we should not fail an entire updating pass -- e.g. for embeddings -- just because a subset don't have anything to update. I fully acknowledge this may be at odds with discussions or decisions when that updating logic was written, when it felt correct to immediately throw an error for a missing document, but this felt like a good example where that behavior is not ideal. In short: it feels like "updating" work, e.g. embeddings, should update docs that exist, but not fail entirely if some documents don't exist in Opensearch. |
Purpose and background context
This PR allows us to use TIM to bulk update pre-existing documents for a given source with all current embeddings for that source.
As noted in both the ticket and commits, the first pass focused only on indexing a single ETL run, which remains the common path in the ETL StepFunction. But until we're relying on the ETL StepFunction entirely, there is value in having the ability to update a source in Opensearch with current embeddings for that source. More specifically, to do so without fully reindexing the source as
reindex-sourcewould do.How can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS credentials
2- Run a bulk updating for a small source like
libguides:When complete, observe the following results:
{"updated": 0, "skipped": 279, "errors": 0, "total": 279}Because embeddings already existed, and were identical to the ones used for updating, they are all skipped.
You could instead perform a full re-index of a source, e.g.
gismit:Results:
{"index": {"created": 2043, "updated": 0, "errors": 16, "total": 2059}, "update": {"updated": 2043, "errors": 16, "total": 2059, "skipped": 0}}This is interesting for a couple of reasons:
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES
What are the relevant tickets?
Code review