Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Jan 26, 2026

Purpose and background context

This PR allows us to create embeddings for all, or a subset of, current records for a given source. Formerly the CLI command create-embeddings only supported creating embeddings for a particular ETL run_id.

While this likely won't get used much in the day-to-day ETL runs, it'll be quite helpful for out-of-band runs. For example, this functionality is getting used right now to generate embeddings for all sources in Dev1.

Additional updates in this PR:

  • pin our HuggingFace model downloads to specific revisions
  • pin to transformers<5.x after a major 5.0 release which had breaking changes for our model

How can a reviewer manually see the effects of these changes?

1- Set Dev1 credentials in terminal

2- Ensure that a model is downloaded (wipe /tmp/te-model if desired):

embeddings --verbose download-model \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model

3- Create embeddins for current libguides source records:

embeddings --verbose create-embeddings \
--model-uri opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte \
--model-path /tmp/te-model \
--strategy full_record \
--dataset-location=s3://timdex-extract-dev-222053980223/dataset \
--source=libguides \
--record-limit 10
  • the last --record-limit flag is optional to keep the job smaller

The use of --source tells the create-embeddings command to retrieve current records for the source versus a single ETL run.

Includes new or updated dependencies?

YES

  • updated dependencies
  • pinned transformers<=5.0

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

The original ergonomics of the CLI command `create-embeddings` was to target
a specific ETL run with `--run-id`, optionally limiting by offset and limit.  This
is essentially all we need for normal day-to-day ETL usage.

But there are situations where we might want to create embeddings for a source,
and we'd really only want the current records of the source, outside of an ETL
context (e.g. new embeddings model or strategy, no need to re-ETL all the records).

How this addresses that need:

Add `--source` flag to the `create-embeddings` command, which is mutually exclusive
with the `--run-id` flag.  By using the `--source` flag, it's implying you want to
create embeddings for current records from that source.  It can still be limited by
offset and limit.

Side effects of this change:
* Ability to run ad-hoc embeddings creation for a source.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-351
Why these changes are being introduced:

The transformers library recently had a 5.0 release,
https://github.com/huggingface/transformers/releases/tag/v5.0.0.  This currently has breaking
changes with our current model opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte.

How this addresses that need:

Until Opensearch patches the model to work with transformers 5.x, we will pin to
transformers<=5.0 which is version 4.57.6 at this time.  Confirmed this works
both local bare metal and locally built Docker containers.

Side effects of this change:
* We will want to keep an eye on the model and transformers library
compatibility to remove this pin when possible.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-351
@ghukill ghukill force-pushed the USE-351-create-embeddings-for-source-current-records branch from 5df25bd to f93e025 Compare January 26, 2026 19:00
@ghukill ghukill marked this pull request as ready for review January 26, 2026 19:33
@ghukill ghukill requested a review from a team January 26, 2026 20:14
@ehanson8 ehanson8 self-assigned this Jan 27, 2026
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and code looks good, approved!

Comment on lines +270 to +273
if run_id and source:
raise click.UsageError("Use either '--run-id' or '--source', not both.")
if not run_id and not source:
raise click.UsageError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent use of exceptions to force expected usage!

@ghukill ghukill merged commit db2679c into main Jan 27, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants