Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Correctness bug in log purge - causes loss of embeddings data #2922

Closed
tazarov opened this issue Oct 10, 2024 · 0 comments · Fixed by #2923
Closed

[Bug]: Correctness bug in log purge - causes loss of embeddings data #2922

tazarov opened this issue Oct 10, 2024 · 0 comments · Fixed by #2923
Assignees
Labels
bug Something isn't working

Comments

@tazarov
Copy link
Contributor

tazarov commented Oct 10, 2024

What happened?

0.5.7

To reproduce in 0.5.7

pip install chromadb==0.5.7

import chromadb
import uuid


client = chromadb.PersistentClient(path="2912")

ids = [str(uuid.uuid4()) for _ in range(5202)]
docs = [f"Document {i}" for i in range(5202)]
embeddings = [[0.1, 0.2, 0.3] for _ in range(5202)]

col = client.get_or_create_collection("test_collection")

col.add(ids=ids, documents=docs, embeddings=embeddings)


ids1 = [str(uuid.uuid4()) for _ in range(2202)]
docs1 = [f"Document {i}" for i in range(2202)]
embeddings1 = [[0.1, 0.2, 0.3] for _ in range(2202)]

col1 = client.get_or_create_collection("test_collection1")

col1.add(ids=ids1, documents=docs1, embeddings=embeddings1)

Restart the notebook/server/session:

import chromadb
client = chromadb.PersistentClient(path="2912")
col = client.get_or_create_collection("test_collection")
res= col.get(include=["documents", "embeddings"])
assert len(res['embeddings']) == 5202

results in:

IndexError                                Traceback (most recent call last)
[<ipython-input-1-466544d3b9d7>](https://jpcubrfly0m-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20241007-060114_RC01_683255037#) in <cell line: 4>()
      2 client = chromadb.PersistentClient(path="2912")
      3 col = client.get_or_create_collection("test_collection")
----> 4 res= col.get(include=["documents", "embeddings"])
      5 assert len(res['ids']) == 5202

5 frames
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://jpcubrfly0m-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20241007-060114_RC01_683255037#) in get_vectors(self, ids)
    388             for label, vector in zip(hnsw_labels, vectors):
    389                 id = self._label_to_id[label]
--> 390                 results[id_to_index[id]] = VectorEmbeddingRecord(
    391                     id=id, embedding=vector
    392                 )

IndexError: list assignment index out of range

0.5.12

Here, the issue exhibits somewhat differently (this is another bug that needs fixing in a separate Issue/PR). While one does not get the IndexError: list assignment index out of range, asserting the expected embedding count points to the same issue:

pip install chromadb==0.5.12

import chromadb
import uuid


client = chromadb.PersistentClient(path="2912")

ids = [str(uuid.uuid4()) for _ in range(5202)]
docs = [f"Document {i}" for i in range(5202)]
embeddings = [[0.1, 0.2, 0.3] for _ in range(5202)]

col = client.get_or_create_collection("test_collection")

col.add(ids=ids, documents=docs, embeddings=embeddings)


ids1 = [str(uuid.uuid4()) for _ in range(2202)]
docs1 = [f"Document {i}" for i in range(2202)]
embeddings1 = [[0.1, 0.2, 0.3] for _ in range(2202)]

col1 = client.get_or_create_collection("test_collection1")

col1.add(ids=ids1, documents=docs1, embeddings=embeddings1)

Restart the notebook/server/session:

import chromadb
client = chromadb.PersistentClient(path="2912")
col = client.get_or_create_collection("test_collection")
res= col.get(include=["documents", "embeddings"])
print("Actual len: ",len(res['embeddings']))
assert len(res['embeddings']) == 5202

Results in:

Actual len:  5000
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-5-7e030716345f>](https://glksq6xpilu-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20241007-060114_RC01_683255037#) in <cell line: 6>()
      4 res= col.get(include=["documents", "embeddings"])
      5 print("Actual len: ",len(res['embeddings']))
----> 6 assert len(res['embeddings']) == 5202

AssertionError:

Analysis

The issue comes from how we purge the embeddings queue:

t = Table("embeddings_queue")
q = (
self.querybuilder()
.from_(t)
.where(t.seq_id < ParameterValue(min_seq_id))
.delete()
)
sql, params = get_sql(q, self.parameter_format())
cur.execute(sql, params)

The above code results in an indiscriminate purge of the embedding queue up to the given collection's min_seq_id (usually the HNSW seq_id). This is logically incorrect in a multi-collection scenarios where other collections may have entries in the embeddings queue with sequences < the purged collection min_seq_id

Impact

The impact of this is quite high and easily reproducible for deployments with more than one collection which are on versions > 0.5.7.

The result of the impact is that while documents and metadata are present the embeddings which have not been committed to HNSW (aka sync_threshold not reached) are lost.

Versions

Chroma version 0.5.7-0.5.12 (single-node and persistent)

Relevant log output

No response

@tazarov tazarov added the bug Something isn't working label Oct 10, 2024
@tazarov tazarov self-assigned this Oct 10, 2024
tazarov added a commit that referenced this issue Oct 10, 2024
Closes #2922
Closes #2912

It might be related to #2905

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - ...
 - New functionality
	 - ...

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*
ccurme pushed a commit to langchain-ai/langchain that referenced this issue Oct 14, 2024
**Description:** Deprecated version of Chroma >=0.5.5 <0.5.12 due to a
serious correctness issue that caused some embeddings for deployments
with multiple collections to be lost (read more on the issue in Chroma
repo)
**Issue:** chroma-core/chroma#2922 (fixed by chroma-core/chroma##2923
and released in
[0.5.13](https://github.com/chroma-core/chroma/releases/tag/0.5.13))
**Dependencies:** N/A
**Twitter handle:** `@t_azarov`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant