Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: My collection loses vectors for some IDs #2912

Closed
ChocoL0rd opened this issue Oct 8, 2024 · 7 comments · Fixed by #2923
Closed

[Bug]: My collection loses vectors for some IDs #2912

ChocoL0rd opened this issue Oct 8, 2024 · 7 comments · Fixed by #2923
Labels
bug Something isn't working

Comments

@ChocoL0rd
Copy link

What happened?

My collection loses vectors for some IDs after some time (or after a few queries, the cause of this is unknown to me). For example, if the collection contains 5202 elements, 202 embeddings disappear, leaving 5000 valid records, while 202 records no longer have embeddings. Similarly, for another collection with 70,634 elements, only 70,000 remain valid, and the others lose their embeddings.

I am certain that the data was written correctly into the collection, and I am sure that the code interacting with these collections does nothing other than queries and get operations. All embeddings are of the same size, and the collection functions normally until an unknown point in time.

Additionally, the command collection.get(include=["embeddings"]) stops working and throws an error:

{
	"name": "IndexError",
	"message": "list assignment index out of range",
	"stack": "---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 old_resp = old_collection.get(include=[\"metadatas\", \"embeddings\"])
      2 old_ids = old_resp[\"ids\"]
      3 old_metas = old_resp[\"metadatas\"]

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/api/models/Collection.py:117, in Collection.get(self, ids, where, limit, offset, where_document, include)
     95 \"\"\"Get embeddings and their associate data from the data store. If no ids or where filter is provided returns
     96 all embeddings up to limit starting at offset.
     97 
   (...)
    108 
    109 \"\"\"
    110 (
    111     valid_ids,
    112     valid_where,
    113     valid_where_document,
    114     valid_include,
    115 ) = self._validate_and_prepare_get_request(ids, where, where_document, include)
--> 117 get_results = self._client._get(
    118     self.id,
    119     valid_ids,
    120     valid_where,
    121     None,
    122     limit,
    123     offset,
    124     where_document=valid_where_document,
    125     include=valid_include,
    126 )
    128 return self._transform_get_response(get_results, include)

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    144 global tracer, granularity
    145 if trace_granularity < granularity:
--> 146     return f(*args, **kwargs)
    147 if not tracer:
    148     return f(*args, **kwargs)

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py:47, in rate_limit.<locals>.decorator.<locals>.wrapper(self, *args, **kwargs)
     42 @wraps(f)
     43 def wrapper(self, *args: Any, **kwargs: Dict[Any, Any]) -> Any:
     44     # If not rate limiting provider is present, just run and return the function.
     46     if self._system.settings.chroma_rate_limiting_provider_impl is None:
---> 47         return f(self, *args, **kwargs)
     49     if subject in kwargs:
     50         subject_value = kwargs[subject]

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/api/segment.py:513, in SegmentAPI._get(self, collection_id, ids, where, sort, limit, offset, page, page_size, where_document, include)
    511     vector_ids = [r[\"id\"] for r in records]
    512     vector_segment = self._manager.get_segment(collection_id, VectorReader)
--> 513     vectors = vector_segment.get_vectors(ids=vector_ids)
    515 # TODO: Fix type so we don't need to ignore
    516 # It is possible to have a set of records, some with metadata and some without
    517 # Same with documents
    519 metadatas = [r[\"metadata\"] for r in records]

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    144 global tracer, granularity
    145 if trace_granularity < granularity:
--> 146     return f(*args, **kwargs)
    147 if not tracer:
    148     return f(*args, **kwargs)

File ~/someperson/retrieval_ann/.venv/lib/python3.10/site-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py:390, in PersistentLocalHnswSegment.get_vectors(self, ids)
    388     for label, vector in zip(hnsw_labels, vectors):
    389         id = self._label_to_id[label]
--> 390         results[id_to_index[id]] = VectorEmbeddingRecord(
    391             id=id, embedding=vector
    392         )
    394 return results

IndexError: list assignment index out of range"
}

However, if I run collection.get(include=[anything but embeddings]), it returns everything as expected.

Chroma version: chromadb==0.5.7
(If an update is necessary, how will this affect my collections? Will I need to recalculate them?)

Versions

chromadb==0.5.7
ubuntu linux 22.04

Relevant log output

No response

@ChocoL0rd ChocoL0rd added the bug Something isn't working label Oct 8, 2024
@tazarov
Copy link
Contributor

tazarov commented Oct 10, 2024

Hey @ChocoL0rd, thanks for reporting this. It seems like a severe problem, so I'll prioritize investigating.

tazarov added a commit that referenced this issue Oct 10, 2024
Closes #2922
Closes #2912

It might be related to #2905

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - ...
 - New functionality
	 - ...

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*
@tazarov
Copy link
Contributor

tazarov commented Oct 10, 2024

@ChocoL0rd, found the culprit of this, you can check #2922, if you want to dive deeper. Bottom line is we're cutting a new release that will fix this and possibly deprecating old releases (>0.5.5 <=0.5.12) on pypi.

@ChocoL0rd
Copy link
Author

ChocoL0rd commented Oct 10, 2024 via email

@tazarov
Copy link
Contributor

tazarov commented Oct 10, 2024

@ChocoL0rd, shortly. Possibly in the next couple of hours.

@ChocoL0rd
Copy link
Author

ChocoL0rd commented Oct 11, 2024 via email

@tazarov
Copy link
Contributor

tazarov commented Oct 11, 2024

Hey @ChocoL0rd, yes. You don't necessarily need to recreate the whole DB. If you have lots of data and you don't want to re-embed then let me know I can help with recreating the embeddings that were missing.

@ChocoL0rd
Copy link
Author

ChocoL0rd commented Oct 11, 2024

@tazarov, first of all, i want to know, which version is stable to use it. As i see 0.4.24 doesn't have this problem. Would you recommend use this version, or 0.5.13 and why? It's important question, because i also use 0.4.24, so i want to know what kind of problems can be there.
Also i have question about difference in vector search. (if it's out of the topic, let me know and lets discuss it in other place, i am going to ask though)

  1. If i have two dbs, first - with cosine similarity, second - with l2 or ip similarity (but vectors are l2 normalized) , does it work the same?
  2. if i have two different versions of chromadb 0.4.24 and 0.5.12, does the same db with the same metric (cosine) going to work the same?

And yeah, i need to recreate my db.

Thanks, for your answers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants