-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] generate IDs during .add() if not provided #2582
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few open questions I think we should address:
- Should we let a user give us IDs only sometimes, e.g. by allowing
None
in theids
list? My gut says no, but worth thinking about. - How should we test the case where we do not pass any IDs? Do we want this to be a part of property testing?
chromadb/test/test_api.py
Outdated
@@ -225,6 +225,22 @@ def test_add(client): | |||
assert collection.count() == 2 | |||
|
|||
|
|||
def test_add_embeddings_without_ids(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now but I think in general we want to stop adding tests to this file; it's a giant kitchen sink and mostly a leftover of the initial tests we had at launch. I think we should instead have a test-per-api surface (not necessarily in this PR). I'll create an issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a new issue for this: #2586
@@ -261,6 +261,22 @@ def _validate_and_prepare_embedding_set( | |||
Optional[Documents], | |||
Optional[URIs], | |||
]: | |||
if ids is None: | |||
if embeddings: | |||
set_size = len(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work correctly if embeddings is actually just one embedding? Since this is OneOrMany
and len
on a single embedding will give you the length of the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, good argument for moving this to property tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will consume unnecessary entropy in the property tests, and should instead live on a better api test which looks for these kinds of exceptions. See #2586
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the new test a property test, exercises this path 379207d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think this should be a property test. This is testing for a specific, known case that should throw an error. We only need a one-off test that should check that that error is thrown when we enter that case. Property testing is I think too heavyweight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the property tests uncovered a few other edge cases that I didn't think about, so I would be in favor of keeping them, they only take a few seconds to run
@@ -134,12 +134,14 @@ def add_embeddings(self, record_set: strategies.RecordSet) -> MultipleResults[ID | |||
if normalized_record_set["embeddings"] | |||
else None, | |||
} | |||
self.collection.add(**normalized_record_set) # type: ignore[arg-type] | |||
result = self.collection.add(**normalized_record_set) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently generate RecordSet
without IDs. Do we want to do that in property tests? This might get more complicated if we want to allow some IDs to be None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in the tests when we first do an add
with IDs, then one without? The comparison here would fail I think.
I don't think so, I can't think of a scenario where this would be helpful. |
Superseded by #2699. |
Closes #2286.
Does not update the JS client since it's in the midst of a refactor.