You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IndexError Traceback (most recent call last)
[<ipython-input-1-466544d3b9d7>](https://jpcubrfly0m-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20241007-060114_RC01_683255037#) in<cell line: 4>()
2 client = chromadb.PersistentClient(path="2912")
3 col = client.get_or_create_collection("test_collection")
----> 4 res= col.get(include=["documents", "embeddings"])
5 assert len(res['ids']) == 5202
5 frames
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://jpcubrfly0m-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20241007-060114_RC01_683255037#) in get_vectors(self, ids)
388 forlabel, vectorin zip(hnsw_labels, vectors):
389 id = self._label_to_id[label]
--> 390 results[id_to_index[id]] = VectorEmbeddingRecord(
391 id=id, embedding=vector
392 )
IndexError: list assignment index out of range
0.5.12
Here, the issue exhibits somewhat differently (this is another bug that needs fixing in a separate Issue/PR). While one does not get the IndexError: list assignment index out of range, asserting the expected embedding count points to the same issue:
The above code results in an indiscriminate purge of the embedding queue up to the given collection's min_seq_id (usually the HNSW seq_id). This is logically incorrect in a multi-collection scenarios where other collections may have entries in the embeddings queue with sequences < the purged collection min_seq_id
Impact
The impact of this is quite high and easily reproducible for deployments with more than one collection which are on versions > 0.5.7.
The result of the impact is that while documents and metadata are present the embeddings which have not been committed to HNSW (aka sync_threshold not reached) are lost.
Versions
Chroma version 0.5.7-0.5.12 (single-node and persistent)
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Closes#2922Closes#2912
It might be related to #2905
## Description of changes
*Summarize the changes made by this PR.*
- Improvements & Bug fixes
- ...
- New functionality
- ...
## Test plan
*How are these changes tested?*
- [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust
## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*
**Description:** Deprecated version of Chroma >=0.5.5 <0.5.12 due to a
serious correctness issue that caused some embeddings for deployments
with multiple collections to be lost (read more on the issue in Chroma
repo)
**Issue:** chroma-core/chroma#2922 (fixed by chroma-core/chroma##2923
and released in
[0.5.13](https://github.com/chroma-core/chroma/releases/tag/0.5.13))
**Dependencies:** N/A
**Twitter handle:** `@t_azarov`
What happened?
0.5.7
To reproduce in 0.5.7
pip install chromadb==0.5.7
Restart the notebook/server/session:
results in:
0.5.12
Here, the issue exhibits somewhat differently (this is another bug that needs fixing in a separate Issue/PR). While one does not get the
IndexError: list assignment index out of range
, asserting the expected embedding count points to the same issue:pip install chromadb==0.5.12
Restart the notebook/server/session:
Results in:
Analysis
The issue comes from how we purge the embeddings queue:
chroma/chromadb/db/mixins/embeddings_queue.py
Lines 162 to 171 in cb88db2
The above code results in an indiscriminate purge of the embedding queue up to the given collection's min_seq_id (usually the HNSW seq_id). This is logically incorrect in a multi-collection scenarios where other collections may have entries in the embeddings queue with sequences < the purged collection min_seq_id
Impact
The impact of this is quite high and easily reproducible for deployments with more than one collection which are on versions > 0.5.7.
The result of the impact is that while documents and metadata are present the embeddings which have not been committed to HNSW (aka
sync_threshold
not reached) are lost.Versions
Chroma version 0.5.7-0.5.12 (single-node and persistent)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: