Skip to content

Support custom embedding callables and make embedding checkpointing and CSV saves more resilient#66

Merged
hemanth-asirvatham merged 1 commit intomainfrom
add-custom-embeddings-fn-support
Feb 26, 2026
Merged

Support custom embedding callables and make embedding checkpointing and CSV saves more resilient#66
hemanth-asirvatham merged 1 commit intomainfrom
add-custom-embeddings-fn-support

Conversation

@hemanth-asirvatham
Copy link
Collaborator

Motivation

  • Allow callers to supply custom per-text or bulk embedding callables so embedding behaviour can be overridden in nested tasks and during testing.
  • Improve robustness when persisting large embedding checkpoints and large CSV outputs so long runs can resume and partial writes do not leave stale artifacts.
  • Route embedding overrides cleanly through high-level APIs into the tasks that actually compute embeddings without leaking them into response calls.

Description

  • Extended public APIs (seed, ideate, deduplicate, merge) to accept embedding_fn and get_all_embeddings_fn and propagate these into nested tasks where appropriate.
  • Updated Seed, Deduplicate, and Merge task implementations to capture embedding overrides from kwargs and forward them into get_all_embeddings calls while ensuring embedding overrides are not forwarded to get_all_responses calls.
  • Reworked get_all_embeddings in gabriel.utils.openai_utils to support a custom per-text embedding_fn or a full-driver get_all_embeddings_fn, to normalize diverse return shapes, to validate custom callable signatures, and to continue using internal batching/retry/checkpointing logic when a per-text override is supplied.
  • Added robust checkpoint load/save helpers for embeddings supporting split pickle parts fallback and merge-on-load, plus normalization helpers (_save_embeddings_checkpoint, _load_embeddings_checkpoint, _normalize_embedding_result) and split-part discovery utilities.
  • Improved CSV persistence in gabriel.utils.file_utils.save_dataframe_with_fallback by adding detection/removal of stale split parts, a prioritized set of fallback chunk sizes (fallback_chunk_sizes), repeated attempts with progressively smaller chunk sizes, and helpers for finding/removing split files.
  • Minor API housekeeping: ensure Seed._request_entities does not pass embedding kwargs into response calls.
  • Tests added and updated to cover routing of embedding overrides, custom embedding callables and drivers, split-checkpoint fallback behaviour, and CSV split-fallback behaviour (tests/test_get_all_embeddings.py, tests/test_save_dataframe_fallback.py, tests/test_api_cached_loading.py, tests/test_basic.py, tests/test_deduplicate.py, tests/test_merge.py).

Testing

  • Ran the unit test suite with pytest; new and updated tests in tests/test_get_all_embeddings.py, tests/test_save_dataframe_fallback.py, tests/test_api_cached_loading.py, tests/test_basic.py, tests/test_deduplicate.py, and tests/test_merge.py were executed and passed.
  • Verified custom per-text embedding_fn and bulk get_all_embeddings_fn are accepted and routed correctly via targeted unit tests.
  • Verified checkpoint split fallback and CSV chunked fallback behaviours via simulated failure tests that assert split files are written and loaded as expected.

Codex Task

@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@hemanth-asirvatham hemanth-asirvatham merged commit c0a942e into main Feb 26, 2026
1 check failed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant