Skip to content

[ENH]: Load HNSW index without disk intermediary #5159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Jul 29, 2025

Description of changes

(WIP)

  • Improvements & Bug fixes
    • Loads the HNSW index from S3 without a disk intermediary by directly passing in the memory buffer given by S3.
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 marked this pull request as ready for review July 29, 2025 14:20
Copy link
Contributor

propel-code-bot bot commented Jul 29, 2025

Load HNSW Index Directly from S3 Memory, Removing Disk Intermediary

This PR overhauls the way HNSW indices are loaded from remote storage (S3) into the service, removing the former step of always copying index files to a temporary disk directory before constructing the in-memory index. The new flow initializes the HNSW index directly from S3-fetched memory buffers in all relevant paths. This change involves new APIs, significant modifications in Rust HNSW index provider code, updates to the hnswlib dependency, and adjustments to loading and forking logic for the index lifecycle.

Key Changes

• Introduces in-memory loading: fetches index segment files into memory buffers and loads the HNSW index directly from these buffers (without disk I/O) via load_from_hnsw_data.
• Updates Cargo.toml and Cargo.lock to point at a new commit in the hnswlib dependency, supporting in-memory data loading.
• Modifies the main HnswIndexProvider (and related) logic so open, fork, and similar operations use the new in-memory approach; new helper methods fetch and assemble memory buffers from S3.
• Augments the PersistentIndex trait with a load_from_hnsw_data method to support generic in-memory loading for other index types.
• Adds a new workflow for handling file-backed operations (fork/copy) so that all S3 fetches are memory-based, refactoring code to support this throughout the stack.

Affected Areas

• rust/index/src/hnsw_provider.rs
• rust/index/src/hnsw.rs
• rust/index/src/types.rs
• Cargo.toml
• Cargo.lock

This summary was automatically generated by @propel-code-bot

Copy link
Contributor

blacksmith-sh bot commented Jul 29, 2025

9 Jobs Failed:

PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/distributed/test_repair_collection_log_offset.py)
Step "Test" from job "Python tests / test-cluster-rust-frontend (3.9, chromadb/test/distributed/test_repair_collection_log_offset.py)" is failing. The last 20 log lines are:

[...]
            response = log_service_stub.InspectLogState(request, timeout=60)
            if '''LogPosition { offset: 1001 }''' in response.debug:
                return
            time.sleep(1)
>       raise RuntimeError("Test timed out without repair")
E       RuntimeError: Test timed out without repair

chromadb/test/distributed/test_repair_collection_log_offset.py:73: RuntimeError
----------------------------- Captured stdout call -----------------------------
Generating data with seed  1755032916.1666057
collection_id = 72cd6a9f-7869-4f23-a32b-c5f974f17bf9
============================= slowest 10 durations =============================
256.46s call     chromadb/test/distributed/test_repair_collection_log_offset.py::test_repair_collection_log_offset[basic_http_client]
0.52s setup    chromadb/test/distributed/test_repair_collection_log_offset.py::test_repair_collection_log_offset[basic_http_client]

(1 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED chromadb/test/distributed/test_repair_collection_log_offset.py::test_repair_collection_log_offset[basic_http_client] - RuntimeError: Test timed out without repair
======================== 1 failed in 257.08s (0:04:17) =========================
Error: Process completed with exit code 1.
PR checks / Lint
Step "Clippy" from job "Lint" is failing. The last 20 log lines are:

[...]
   Compiling quote v0.6.13
   Compiling matrixmultiply v0.3.9
   Compiling tracing-test-macro v0.2.5
    Checking rawpointer v0.2.1
    Checking tracing-test v0.2.5
    Checking tokio-test v0.4.4
    Checking approx v0.5.1
    Checking ndarray v0.16.1
   Compiling convert_case v0.6.0
   Compiling proptest-derive v0.3.0
   Compiling napi-build v2.1.6
   Compiling chromadb-js-bindings v0.1.0 (/home/runner/_work/chroma/chroma/rust/js_bindings)
   Compiling napi-derive-backend v1.0.75
   Compiling ctor v0.2.9
    Checking napi-sys v2.4.0
    Checking napi v2.16.17
   Compiling napi-derive v2.16.13
error: could not compile `chroma-index` (lib test) due to 2 previous errors
    Checking chroma v0.1.0 (/home/runner/_work/chroma/chroma/rust/chroma)
Error: Process completed with exit code 101.
PR checks / Rust tests / test (blacksmith-8vcpu-ubuntu-2204)
Step "Test" from job "Rust tests / test (blacksmith-8vcpu-ubuntu-2204)" is failing. The last 20 log lines are:

[...]
      28: core::ops::function::FnOnce::call_once
                 at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
      29: core::ops::function::FnOnce::call_once
                 at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
    note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

  Cancelling due to test failure: 7 tests still running
        PASS [   7.724s] chroma-segment blockfile_metadata::test::test_simple_regex
        PASS [  10.006s] chroma-load tests::end_to_end
        PASS [  10.203s] chroma-segment blockfile_metadata::test::test_composite_regex
        PASS [  16.357s] chroma-log sqlite_log::tests::test_push_pull_logs
        PASS [  24.358s] chroma-blockstore::blockfile_writer_test tests::blockfile_writer_test
        PASS [  36.951s] chroma-frontend::test_collection test_collection_sqlite
        PASS [  58.629s] chroma-blockstore arrow::concurrency_test::tests::test_blockfile_shuttle
────────────
     Summary [  59.379s] 290/518 tests run: 289 passed, 1 failed, 78 skipped
        FAIL [   0.375s] chroma-segment distributed_spann::test::test_spann_segment_writer
warning: 228/518 tests were not run due to test failure (run with --no-fail-fast to run all tests, or run with --max-fail)
error: test run failed
Error: Process completed with exit code 100.
PR checks / Rust tests / Integration test ci_k8s_integration 2
Step "Run tests" from job "Rust tests / Integration test ci_k8s_integration 2" is failing. The last 20 log lines are:

[...]

    failures:

    failures:
        garbage_collector_component::tests::test_k8s_integration_ignores_forked_collections

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 37 filtered out; finished in 30.18s
    
  stderr ───
    thread 'garbage_collector_component::tests::test_k8s_integration_ignores_forked_collections' panicked at rust/garbage_collector/src/garbage_collector_component.rs:687:10:
    called `Result::unwrap()` on an `Err` value: "Timeout waiting for new version to be created"
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

  Cancelling due to test failure
────────────
     Summary [ 360.250s] 16/40 tests run: 15 passed (1 slow), 1 failed, 556 skipped
        FAIL [  30.207s] garbage_collector garbage_collector_component::tests::test_k8s_integration_ignores_forked_collections
warning: 24/40 tests were not run due to test failure (run with --no-fail-fast to run all tests, or run with --max-fail)
error: test run failed
Error: Process completed with exit code 100.
PR checks / Rust tests / test-long
Step "Test" from job "Rust tests / test-long" is failing. The last 20 log lines are:

[...]
      17: chroma_index::spann::types::tests::test_long_running_integrity_multiple_runs::{{closure}}
                 at ./src/spann/types.rs:4424:51
      18: core::ops::function::FnOnce::call_once
                 at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
      19: core::ops::function::FnOnce::call_once
                 at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
    note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

        PASS [  40.079s] chroma-index spann::types::tests::test_long_running_data_integrity_parallel
        SLOW [> 60.000s] chroma-index spann::types::tests::test_long_running_data_integrity
        SLOW [>120.000s] chroma-index spann::types::tests::test_long_running_data_integrity
        SLOW [>180.000s] chroma-index spann::types::tests::test_long_running_data_integrity
        PASS [ 184.354s] chroma-index spann::types::tests::test_long_running_data_integrity
────────────
     Summary [ 184.355s] 5 tests run: 2 passed (1 slow), 3 failed, 591 skipped
        FAIL [   3.172s] chroma-index spann::types::tests::test_long_running_data_integrity_multiple_parallel_runs
        FAIL [   2.887s] chroma-index spann::types::tests::test_long_running_data_integrity_multiple_parallel_runs_with_updates_deletes
        FAIL [   3.367s] chroma-index spann::types::tests::test_long_running_integrity_multiple_runs
error: test run failed
Error: Process completed with exit code 100.
PR checks / Rust tests / test-benches (blacksmith-16vcpu-ubuntu-2204, --bench get)
Step "Run benchmark" from job "Rust tests / test-benches (blacksmith-16vcpu-ubuntu-2204, --bench get)" is failing. The last 20 log lines are:

[...]
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/mod.rs:87:22
  13: tokio::runtime::context::runtime::enter_runtime
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/context/runtime.rs:65:16
  14: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
  15: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/runtime.rs:370:50
  16: tokio::runtime::runtime::Runtime::block_on
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/runtime.rs:340:13
  17: get::bench_get
             at ./benches/get.rs:125:25
  18: get::benches
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/criterion-0.5.1/src/macros.rs:71:17
  19: get::main
             at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/criterion-0.5.1/src/macros.rs:124:17
  20: core::ops::function::FnOnce::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: bench failed, to rerun pass `-p worker --bench get`
Error: Process completed with exit code 101.
PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)
Step "Test" from job "Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)" is failing. The last 20 log lines are:

[...]
chromadb/test/property/test_add.py::test_add_medium[basic_http_client]
  /opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/hypothesis/core.py:899: HypothesisWarning: Generating overly large repr. This is an expensive operation, and with a length of 9001 kB is unlikely to be useful. Use -Wignore to ignore the warning, or -Werror to get a traceback.
    text_repr = repr_call(test, args, kwargs)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= slowest 10 durations =============================
770.30s call     chromadb/test/property/test_add.py::test_add_medium[basic_http_client]
579.69s call     chromadb/test/property/test_add.py::test_add_small[basic_http_client]
47.61s call     chromadb/test/property/test_add.py::test_add_large[basic_http_client]
27.58s call     chromadb/test/property/test_add.py::test_add_miniscule[basic_http_client]
1.38s call     chromadb/test/property/test_add.py::test_add_large_exceeding[basic_http_client]
0.41s setup    chromadb/test/property/test_add.py::test_add_miniscule[basic_http_client]
0.41s setup    chromadb/test/property/test_add.py::test_add_large_exceeding[basic_http_client]
0.41s setup    chromadb/test/property/test_add.py::test_add_small[basic_http_client]
0.41s setup    chromadb/test/property/test_add.py::test_add_partial[basic_http_client]
0.41s setup    chromadb/test/property/test_add.py::test_add_medium[basic_http_client]
=========================== short test summary info ============================
FAILED chromadb/test/property/test_add.py::test_add_medium[basic_http_client] - AssertionError() [single exception in FlakyFailure]
======= 1 failed, 5 passed, 1 xfailed, 525 warnings in 778.69s (0:12:58) =======
Error: Process completed with exit code 1.
PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_embeddings.py)
Step "Test" from job "Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_embeddings.py)" is failing. The last 20 log lines are:

[...]
state.wait_for_compaction()
state.wait_for_compaction()
state.teardown()
state.teardown()
Explanation:
Explanation:
    These lines were always and only run by failing examples:
    These lines were always and only run by failing examples:
        /home/runner/_work/chroma/chroma/chromadb/test/utils/wait_for_version_increase.py:26
        /home/runner/_work/chroma/chroma/chromadb/test/utils/wait_for_version_increase.py:26
        /home/runner/_work/chroma/chroma/chromadb/test/utils/wait_for_version_increase.py:27
        /home/runner/_work/chroma/chroma/chromadb/test/utils/wait_for_version_increase.py:27


You can reproduce this example by temporarily adding @reproduce_failure('6.112.2', b'AXicXZZJiONVEMarks7SW3q6k+7O2ukk00umO1t3kk46nbQoKCOIgoio4D4jLugIenMZQXE5eNCT6M2bXvSm4MHlMCgqevQggiAIiowK6m38Vf3T49gJyX9571V99dVX9Z6I1FRyKjGVhkpXpCTykPwaUhWRu2RCVPM60H2uA93if01juqxl4XZVa3qgCd3TQ11S0Tkd6QntSUjGn7mrD1Ub2tSknpSwqs5rSDd1i3lNm6Xng3mM7GtGwxrVnJuuYXNBV5h/Sic0bTNmdMikFk6ruuDGupp3I88lzuZ/PnPfWwZoXisaZ2EEYz2dxnlMdzBg8y58/9aZwNtusK775d/31yYuqgUgequN7X6tAIy4/W0P5MI3RxgXiH2XIHs69T/w9mc8pflNEdwyFgYg3dElWzbLbRL4ITiqw49FOYOhguhttvalT1VP6qo7e270/LOX+JjJkNES4qssCU+o2WqL9m3swhfG5QbuNnhBPLKz/eFv5EwrDn2GrITxUdc+vuegJCl6t8jmHS8uG3FxzQlsYemJWYXQddhf1UlmVsnAEDtbmtIsiOe0QAZShvZekfnv3g9brkoEF9Uy16SEx2EfYLfH+kMwWaQ56DpJ/mIAb5HRJrFv8C6vCVbM8aTM2mVtX4ugTKMR0W1tMRohjQNojGlH26xcM4zC7ZhwJYKemt8i2FOiizbyetRSmWd2zf2k8LuOla6usSTN8llP0hqxmpymeDPC0w4IwmCZ5BrieQOPu2MbbSJ3sRT0+fLpm950HZimmgQ8z11MZxz2Ht+aIStzk3XO4xZL0TF/bixXsNdhfEX0apdOhrcsieF3E7MhWNgBhZkTkBX4RRjLYbHNaJzVO2BP6DJ+kp5ptOXK+fiyTEOQMeImBfthDR2p84CvYO6AxG1gZOA1tUhyCyZPnqgB4GCT1JS49qkAI6sBFSFqap2Y6+K5a+q0szKu8qusSWTAw2AjoKt58Z4zf76HTDvA6FETu/gsYzNNOsqIBP+3+/q8YjmEl20PqCfWAUQe/sGozCKDBTDGwAiZLXdYsRKGbHdU+uDcdTc+vmyoSiSEzoKTjuc7zXMDI5gEQJIwTPLmZAVIe6h0CtdR5h+4rY/liMOIIFxZfvqfJSPJqGsBwjTXhdND13yX9mK07jNjzTtYAa0ITyMaExMSsN2k0KydnWBwAUBDRG4tMfgMHUzdnX9y7rJ3axsjfIVAtsbzCJxN9zkH3hmWb/rCGpoyF7OicTrAVy8kvClbek74dNVokJQoCFvOQRRtZQDlre1yQ6xRGfsAnOd3ghcL+IxjvUgQGy6UKmNtN7qCQVIoxGSLaZhta5iMbIA4z3eH5NpziVVJ7Ip7O5/6sfP2T+Ztj7e7oFmHy7mA+ew7/lFFqk+JWBM01ANw94PV0bP2ecCwiFvfR6yXd5r/PrRNG80Sb5VY533uKWJGwF6oiyRhCbEF8nWahkDJ2U0WVN53U+g2DJc9pMHztPfwDjyt8z+FHOuiD5q7d9/0ZhSD9QrELFGaQnoIBDqG3GchxDaykjVqzHSQTAMCO3jJYcYAFG23daIH3MePJachMGq3+9b0y5BcRxZXasbEOABaCZ8NqJ0wZTmuIfPjZCrugmng+xBE8yhklacMd9NQRO3a5ymz3xTbGdK0oYpvelZJrkkD9Un890cez377xxFxq7jagijb7nghU7ZZcZ30Cp1nhBbypBu/1so5F8ixDN4UuJbwHsdThMkZ3pxyPNt0DOtRCyQwg7iyQDhgLGF4KihyFGzgVygqROi9Y30wSsaqNLawFzSWSDlVmPNM2xaXcSZTXBcDEU6/+lN99tlLth1nkLTYpgmaOAmfhd0lwqyCpMi7WVJSGou+Bfo8oo4QFW1i3GAmael7oJ6B75rotUHnUj/m5FmRRcC2IVkDmfKtpuqHnQpxh307XBS9wRatknfLRRGLYVdR0MSmsQ/t3CWJKyp6p81uX6Pe2jZ9h7L+vXusgKzj72PPtLFsm6cfFqogTmNzFYb6Zm2g8thE67S6isuMFgNmPfGLrImCtwaCBGui2g1kGOJwZDN6TKkCI+e0d3A05E3JN6LoMYFvMY+qu8VEPvOy9bst4JsUIsC349iV2fb4xzAyfhwrYR8RS4hjrJtbRttt/lvkoUo1CjFnyFjPGZzwA8s20DqcaMPBfmeyvd4M/9ILjjNtYiIqLAWnzP/OgQVG+zDSsoONqdm38DwMnfIKT7t2+lRaAS9DbGct005KgaD68NYly1nR0yLnbn7lUcqLeX2vvZFOmpMkk+sY22KxddFJniNcc3aqCYr/nsfePj9W+iZdW29zfH/puE+3PLSNY5WSJK8JyDjgmkF6XR//rHkUnMl5EVnWfXURbyOSvOKntTYrZryFpUElhFIEYxgJeY3ZxsJmhATSMGgJtGOdbSEptd1iP9iwJ8++9sYzH1Xc379Do/qL') as a decorator on your test case
You can reproduce this example by temporarily adding @reproduce_failure('6.112.2', b'AXicXZZJiONVEMarks7SW3q6k+7O2ukk00umO1t3kk46nbQoKCOIgoio4D4jLugIenMZQXE5eNCT6M2bXvSm4MHlMCgqevQggiAIiowK6m38Vf3T49gJyX9571V99dVX9Z6I1FRyKjGVhkpXpCTykPwaUhWRu2RCVPM60H2uA93if01juqxl4XZVa3qgCd3TQ11S0Tkd6QntSUjGn7mrD1Ub2tSknpSwqs5rSDd1i3lNm6Xng3mM7GtGwxrVnJuuYXNBV5h/Sic0bTNmdMikFk6ruuDGupp3I88lzuZ/PnPfWwZoXisaZ2EEYz2dxnlMdzBg8y58/9aZwNtusK775d/31yYuqgUgequN7X6tAIy4/W0P5MI3RxgXiH2XIHs69T/w9mc8pflNEdwyFgYg3dElWzbLbRL4ITiqw49FOYOhguhttvalT1VP6qo7e270/LOX+JjJkNES4qssCU+o2WqL9m3swhfG5QbuNnhBPLKz/eFv5EwrDn2GrITxUdc+vuegJCl6t8jmHS8uG3FxzQlsYemJWYXQddhf1UlmVsnAEDtbmtIsiOe0QAZShvZekfnv3g9brkoEF9Uy16SEx2EfYLfH+kMwWaQ56DpJ/mIAb5HRJrFv8C6vCVbM8aTM2mVtX4ugTKMR0W1tMRohjQNojGlH26xcM4zC7ZhwJYKemt8i2FOiizbyetRSmWd2zf2k8LuOla6usSTN8llP0hqxmpymeDPC0w4IwmCZ5BrieQOPu2MbbSJ3sRT0+fLpm950HZimmgQ8z11MZxz2Ht+aIStzk3XO4xZL0TF/bixXsNdhfEX0apdOhrcsieF3E7MhWNgBhZkTkBX4RRjLYbHNaJzVO2BP6DJ+kp5ptOXK+fiyTEOQMeImBfthDR2p84CvYO6AxG1gZOA1tUhyCyZPnqgB4GCT1JS49qkAI6sBFSFqap2Y6+K5a+q0szKu8qusSWTAw2AjoKt58Z4zf76HTDvA6FETu/gsYzNNOsqIBP+3+/q8YjmEl20PqCfWAUQe/sGozCKDBTDGwAiZLXdYsRKGbHdU+uDcdTc+vmyoSiSEzoKTjuc7zXMDI5gEQJIwTPLmZAVIe6h0CtdR5h+4rY/liMOIIFxZfvqfJSPJqGsBwjTXhdND13yX9mK07jNjzTtYAa0ITyMaExMSsN2k0KydnWBwAUBDRG4tMfgMHUzdnX9y7rJ3axsjfIVAtsbzCJxN9zkH3hmWb/rCGpoyF7OicTrAVy8kvClbek74dNVokJQoCFvOQRRtZQDlre1yQ6xRGfsAnOd3ghcL+IxjvUgQGy6UKmNtN7qCQVIoxGSLaZhta5iMbIA4z3eH5NpziVVJ7Ip7O5/6sfP2T+Ztj7e7oFmHy7mA+ew7/lFFqk+JWBM01ANw94PV0bP2ecCwiFvfR6yXd5r/PrRNG80Sb5VY533uKWJGwF6oiyRhCbEF8nWahkDJ2U0WVN53U+g2DJc9pMHztPfwDjyt8z+FHOuiD5q7d9/0ZhSD9QrELFGaQnoIBDqG3GchxDaykjVqzHSQTAMCO3jJYcYAFG23daIH3MePJachMGq3+9b0y5BcRxZXasbEOABaCZ8NqJ0wZTmuIfPjZCrugmng+xBE8yhklacMd9NQRO3a5ymz3xTbGdK0oYpvelZJrkkD9Un890cez377xxFxq7jagijb7nghU7ZZcZ30Cp1nhBbypBu/1so5F8ixDN4UuJbwHsdThMkZ3pxyPNt0DOtRCyQwg7iyQDhgLGF4KihyFGzgVygqROi9Y30wSsaqNLawFzSWSDlVmPNM2xaXcSZTXBcDEU6/+lN99tlLth1nkLTYpgmaOAmfhd0lwqyCpMi7WVJSGou+Bfo8oo4QFW1i3GAmael7oJ6B75rotUHnUj/m5FmRRcC2IVkDmfKtpuqHnQpxh307XBS9wRatknfLRRGLYVdR0MSmsQ/t3CWJKyp6p81uX6Pe2jZ9h7L+vXusgKzj72PPtLFsm6cfFqogTmNzFYb6Zm2g8thE67S6isuMFgNmPfGLrImCtwaCBGui2g1kGOJwZDN6TKkCI+e0d3A05E3JN6LoMYFvMY+qu8VEPvOy9bst4JsUIsC349iV2fb4xzAyfhwrYR8RS4hjrJtbRttt/lvkoUo1CjFnyFjPGZzwA8s20DqcaMPBfmeyvd4M/9ILjjNtYiIqLAWnzP/OgQVG+zDSsoONqdm38DwMnfIKT7t2+lRaAS9DbGct005KgaD68NYly1nR0yLnbn7lUcqLeX2vvZFOmpMkk+sY22KxddFJniNcc3aqCYr/nsfePj9W+iZdW29zfH/puE+3PLSNY5WSJK8JyDjgmkF6XR//rHkUnMl5EVnWfXURbyOSvOKntTYrZryFpUElhFIEYxgJeY3ZxsJmhATSMGgJtGOdbSEptd1iP9iwJ8++9sYzH1Xc379Do/qL') as a decorator on your test case
====== 3 failed, 20 passed, 1 xpassed, 369 warnings in 6729.98s (1:52:09) ======
====== 3 failed, 20 passed, 1 xpassed, 369 warnings in 6729.98s (1:52:09) ======
Error: Process completed with exit code 1.
Error: Process completed with exit code 1.
PR checks / all-required-pr-checks-passed
Step "Decide whether the needed jobs succeeded or failed" from job "all-required-pr-checks-passed" is failing. The last 20 log lines are:

[...]
}
EOM
)"
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
  GITHUB_REPO_NAME: chroma-core/chroma
  PYTHONPATH: /home/runner/_work/_actions/re-actors/alls-green/release/v1/src
# ❌ Some of the required to succeed jobs failed 😢😢😢

📝 Job statuses:
📝 python-tests → ❌ failure [required to succeed or be skipped]
📝 python-vulnerability-scan → ✓ success [required to succeed or be skipped]
📝 javascript-client-tests → ✓ success [required to succeed or be skipped]
📝 rust-tests → ❌ failure [required to succeed or be skipped]
📝 go-tests → ✓ success [required to succeed or be skipped]
📝 lint → ❌ failure [required to succeed]
📝 check-helm-version-bump → ⬜ skipped [required to succeed or be skipped]
📝 delete-helm-comment → ✓ success [required to succeed or be skipped]
Error: Process completed with exit code 1.

Summary: 1 successful workflow, 1 failed workflow

Last updated: 2025-08-12 23:00:48 UTC

Comment on lines +223 to +230
None => match HnswIndex::load_from_hnsw_data(
self.fetch_hnsw_segment(&new_id, prefix_path)
.await
.map_err(|e| Box::new(HnswIndexProviderForkError::FileError(*e)))?,
&index_config,
ef_search,
new_id,
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The logic for loading the index within the fork method appears to be incorrect. It attempts to fetch the segment from remote storage using new_id, but the index for new_id doesn't exist in storage yet. The files for source_id have just been copied to a local directory.

The previous implementation using HnswIndex::load(storage_path_str, ...) correctly loaded the index from this new local directory. Since fork is intended to create a mutable, file-backed copy of an index, it seems the original approach of loading from the local path should be restored.

Comment on lines +287 to +301
let hnsw_data = self.fetch_hnsw_segment(source_id, prefix_path).await?;
let getters = [
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.header_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.data_level0_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.length_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.link_list_buffer())),
];

for (file, getter) in FILES.iter().zip(getters) {
let file_path = index_storage_path.join(file);
self.copy_bytes_to_local_file(&file_path, getter(&hnsw_data))
.await?;
}
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

This function now fetches the entire HNSW segment into an in-memory HnswData object before writing the individual files to disk. The previous implementation streamed each file directly. For large indexes, this change could significantly increase memory usage during the fork operation. Was this change intentional? If fork still needs to write to disk, perhaps restoring the previous file-by-file download logic for this function would be more memory-efficient.

Comment on lines +408 to +420
None => match HnswIndex::load_from_hnsw_data(
self.fetch_hnsw_segment(id, prefix_path)
.await
.map_err(|e| Box::new(HnswIndexProviderOpenError::FileError(*e)))?,
&index_config,
ef_search,
*id,
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

This change successfully loads the index from memory, which aligns with the PR's goal. However, the open function still contains calls that write the index to a temporary directory on disk (create_dir_all, load_hnsw_segment_into_directory) before this memory-based loading occurs. These disk operations now seem redundant.

To fully load without a disk intermediary and improve efficiency, you could remove the calls to create_dir_all, load_hnsw_segment_into_directory, and purge_one_id from this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant