-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: Load HNSW index without disk intermediary #5159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Load HNSW Index Directly from S3 Memory, Removing Disk Intermediary This PR overhauls the way HNSW indices are loaded from remote storage (S3) into the service, removing the former step of always copying index files to a temporary disk directory before constructing the in-memory index. The new flow initializes the HNSW index directly from S3-fetched memory buffers in all relevant paths. This change involves new APIs, significant modifications in Rust HNSW index provider code, updates to the hnswlib dependency, and adjustments to loading and forking logic for the index lifecycle. Key Changes• Introduces in-memory loading: fetches index segment files into memory buffers and loads the Affected Areas• rust/index/src/ This summary was automatically generated by @propel-code-bot |
9 Jobs Failed: PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/distributed/test_repair_collection_log_offset.py)
PR checks / Lint
PR checks / Rust tests / test (blacksmith-8vcpu-ubuntu-2204)
PR checks / Rust tests / Integration test ci_k8s_integration 2
PR checks / Rust tests / test-long
PR checks / Rust tests / test-benches (blacksmith-16vcpu-ubuntu-2204, --bench get)
PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)
PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_embeddings.py)
PR checks / all-required-pr-checks-passed
Summary: 1 successful workflow, 1 failed workflow
Last updated: 2025-08-12 23:00:48 UTC |
None => match HnswIndex::load_from_hnsw_data( | ||
self.fetch_hnsw_segment(&new_id, prefix_path) | ||
.await | ||
.map_err(|e| Box::new(HnswIndexProviderForkError::FileError(*e)))?, | ||
&index_config, | ||
ef_search, | ||
new_id, | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The logic for loading the index within the fork
method appears to be incorrect. It attempts to fetch the segment from remote storage using new_id
, but the index for new_id
doesn't exist in storage yet. The files for source_id
have just been copied to a local directory.
The previous implementation using HnswIndex::load(storage_path_str, ...)
correctly loaded the index from this new local directory. Since fork
is intended to create a mutable, file-backed copy of an index, it seems the original approach of loading from the local path should be restored.
let hnsw_data = self.fetch_hnsw_segment(source_id, prefix_path).await?; | ||
let getters = [ | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.header_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.data_level0_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.length_buffer())), | ||
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.link_list_buffer())), | ||
]; | ||
|
||
for (file, getter) in FILES.iter().zip(getters) { | ||
let file_path = index_storage_path.join(file); | ||
self.copy_bytes_to_local_file(&file_path, getter(&hnsw_data)) | ||
.await?; | ||
} | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
This function now fetches the entire HNSW segment into an in-memory HnswData
object before writing the individual files to disk. The previous implementation streamed each file directly. For large indexes, this change could significantly increase memory usage during the fork
operation. Was this change intentional? If fork
still needs to write to disk, perhaps restoring the previous file-by-file download logic for this function would be more memory-efficient.
None => match HnswIndex::load_from_hnsw_data( | ||
self.fetch_hnsw_segment(id, prefix_path) | ||
.await | ||
.map_err(|e| Box::new(HnswIndexProviderOpenError::FileError(*e)))?, | ||
&index_config, | ||
ef_search, | ||
*id, | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
This change successfully loads the index from memory, which aligns with the PR's goal. However, the open
function still contains calls that write the index to a temporary directory on disk (create_dir_all
, load_hnsw_segment_into_directory
) before this memory-based loading occurs. These disk operations now seem redundant.
To fully load without a disk intermediary and improve efficiency, you could remove the calls to create_dir_all
, load_hnsw_segment_into_directory
, and purge_one_id
from this function.
e9564cf
to
7558379
Compare
Description of changes
(WIP)
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?