USE 355 - index current embeddings for a source #377

ghukill · 2026-01-27T16:12:25Z

Purpose and background context

This PR allows us to use TIM to bulk update pre-existing documents for a given source with all current embeddings for that source.

As noted in both the ticket and commits, the first pass focused only on indexing a single ETL run, which remains the common path in the ETL StepFunction. But until we're relying on the ETL StepFunction entirely, there is value in having the ability to update a source in Opensearch with current embeddings for that source. More specifically, to do so without fully reindexing the source as reindex-source would do.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Run a bulk updating for a small source like libguides:

pipenv run tim --verbose \
bulk-update-embeddings \
--source libguides \
s3://timdex-extract-dev-222053980223/dataset

When complete, observe the following results:

{"updated": 0, "skipped": 279, "errors": 0, "total": 279}

Because embeddings already existed, and were identical to the ones used for updating, they are all skipped.

You could instead perform a full re-index of a source, e.g. gismit:

pipenv run tim --verbose reindex-source --source gismit s3://timdex-extract-dev-222053980223/dataset

Results:

{"index": {"created": 2043, "updated": 0, "errors": 16, "total": 2059}, "update": {"updated": 2043, "errors": 16, "total": 2059, "skipped": 0}}

This is interesting for a couple of reasons:

we see that during indexing (creation) of documents we have 16 errors
during updating of documents we have 16 errors because the associated Opensearch document didn't exist
but the other 2043 are updated successfully

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-355

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: The first pass at bulk updating pre-existing documents, encapsulated in the command `bulk-update-embeddings` required passing a `--run-id` to target a specific ETL run. This aligns with the most common use case of indexing embeddings within an ETL run. However, we have use cases now for indexing all current embeddings for a given source into Opensearch. These current embeddings may span multiple ETL runs. How this addresses that need: Updates the `bulk-update-embeddings` CLI command to require only `--source`, defaulting to retrieving all current embeddings for that source. This logic is identical to what `reindex-source` was already doing, but is decoupled from re-indexing the documents themselves which is not always required. While working on this, it was decided that raising an exception for a missing document when performing updates is not ideal. Some sources have indexing issues, and we have historically skipped those records. When we get to bulk updates, it's possible that we have embeddings for documents that were never indexed; we will log and skip them now in a similar fashion. Side effects of this change: * CLI supports ad-hoc indexing of all current embeddings for a source Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355

coveralls · 2026-01-27T16:15:13Z

Pull Request Test Coverage Report for Build 21405528265

Details

10 of 12 (83.33%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 95.573%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tim/opensearch.py	2	4	50.0%

Files with Coverage Reduction	New Missed Lines	%
tim/opensearch.py	1	93.81%

Totals
Change from base Build 20438471979:	-0.2%
Covered Lines:	475
Relevant Lines:	497

💛 - Coveralls

Why these changes are being introduced: When bulk updating documents the result can be "noop" which means no operation was performed. This can happen if the update would have zero effect. How this addresses that need: Handle `result=noop` during bulk updates and set a new "skipped" result counter in the results. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355

jonavellecuerdo · 2026-01-28T16:59:23Z

@ghukill Can you clarify what the cause of those 16 errors were? 🤔 Are they records that were deleted since the last time we created embeddings for the gismit source?

ghukill · 2026-01-28T17:06:54Z

@ghukill Can you clarify what the cause of those 16 errors were? 🤔 Are they records that were deleted since the last time we created embeddings for the gismit source?

Those errors happen each time we do a "full" run for gismit. The errors are the strict geospatial fields getting bad values. It's something we need to resolve, as they aren't getting indexed into Opensearch (I don't think).

But in the context of this PR, I think it revealed that for some reason we may occassionally have documents missing in Opensearch and we should not fail an entire updating pass -- e.g. for embeddings -- just because a subset don't have anything to update. I fully acknowledge this may be at odds with discussions or decisions when that updating logic was written, when it felt correct to immediately throw an error for a missing document, but this felt like a good example where that behavior is not ideal.

In short: it feels like "updating" work, e.g. embeddings, should update docs that exist, but not fail entirely if some documents don't exist in Opensearch.

ghukill added 2 commits January 27, 2026 09:39

Update dependencies

57696d3

ghukill marked this pull request as ready for review January 27, 2026 16:47

ghukill requested a review from a team as a code owner January 27, 2026 16:47

jonavellecuerdo self-assigned this Jan 28, 2026

jonavellecuerdo approved these changes Jan 28, 2026

View reviewed changes

ghukill merged commit 2d3f255 into main Jan 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 355 - index current embeddings for a source #377

USE 355 - index current embeddings for a source #377

Uh oh!

ghukill commented Jan 27, 2026 •

edited

Loading

Uh oh!

coveralls commented Jan 27, 2026 •

edited

Loading

Uh oh!

jonavellecuerdo commented Jan 28, 2026

Uh oh!

ghukill commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

USE 355 - index current embeddings for a source #377

USE 355 - index current embeddings for a source #377

Uh oh!

Conversation

ghukill commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21405528265

Details

💛 - Coveralls

Uh oh!

jonavellecuerdo commented Jan 28, 2026

Uh oh!

ghukill commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Jan 27, 2026 •

edited

Loading

coveralls commented Jan 27, 2026 •

edited

Loading