-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH] Add support for default space in create coll config #5293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
1afbb8b
to
118338e
Compare
Support Default Embedding Space in Collection Configuration, with Space Validation This PR introduces enhanced logic for handling default vector spaces in collection configurations when creating collections in ChromaDB. If a user does not explicitly provide a distance space (e.g., 'cosine', 'l2') in the configuration or metadata, the system now sets it to the embedding function's defined default. Furthermore, the changes add validation to ensure that any user-provided space is supported by the relevant embedding function/model; unsupported combinations produce warnings (not errors) to adhere to robustness principles. Accompanying this feature are major test expansions for both Python and JavaScript, and updates to client, serialization, server/database, and embedding function logic across the stack. Key Changes• Collection configuration logic now uses the embedding function Affected Areas• chromadb/api/ This summary was automatically generated by @propel-code-bot |
if hnsw_config is not None and hnsw_config.get("space") is None: | ||
hnsw_config["space"] = ef.default_space() | ||
if spann_config is not None and spann_config.get("space") is None: | ||
spann_config["space"] = ef.default_space() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The logic for handling hnsw_config
and spann_config
mutations can lead to unexpected side effects. When you modify these dictionaries with hnsw_config["space"] = ef.default_space()
, you're potentially mutating the original input parameters, which could affect the caller's data.
Consider creating copies before modification:
if hnsw_config is not None and hnsw_config.get("space") is None: | |
hnsw_config["space"] = ef.default_space() | |
if spann_config is not None and spann_config.get("space") is None: | |
spann_config["space"] = ef.default_space() | |
# if hnsw config or spann config exists but space is not provided, populate it from ef | |
if hnsw_config is not None and hnsw_config.get("space") is None: | |
hnsw_config = dict(hnsw_config) # Create a copy to avoid mutation | |
hnsw_config["space"] = ef.default_space() | |
if spann_config is not None and spann_config.get("space") is None: | |
spann_config = dict(spann_config) # Create a copy to avoid mutation | |
spann_config["space"] = ef.default_space() |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point @jai, we shouldn't mutate callers. Suggest higher up in the code we copy or remove the mutation. Imagine I create many collections with the same config
118338e
to
dea4784
Compare
// Get the effective embedding function for space validation and defaults | ||
const effectiveEmbeddingFunction = embeddingFunction || configuration?.embeddingFunction; | ||
|
||
if (effectiveEmbeddingFunction && effectiveEmbeddingFunction.defaultSpace && effectiveEmbeddingFunction.supportedSpaces) { | ||
// If no vector index config exists but we have an embedding function, | ||
// create default HNSW config with space from embedding function | ||
if (configuration?.hnsw === undefined && configuration?.spann === undefined) { | ||
if (metadata === undefined || metadata?.["hnsw:space"] === undefined) { | ||
// this populates space from ef if not provided in either config or metadata | ||
if (!configuration) configuration = {}; | ||
configuration.hnsw = { space: effectiveEmbeddingFunction.defaultSpace() }; | ||
} | ||
} | ||
|
||
// If HNSW config exists but space is not provided, populate it from embedding function | ||
if (configuration?.hnsw && !configuration.hnsw.space && effectiveEmbeddingFunction.defaultSpace) { | ||
configuration.hnsw.space = effectiveEmbeddingFunction.defaultSpace(); | ||
} | ||
|
||
// If SPANN config exists but space is not provided, populate it from embedding function | ||
if (configuration?.spann && !configuration.spann.space && effectiveEmbeddingFunction.defaultSpace) { | ||
configuration.spann.space = effectiveEmbeddingFunction.defaultSpace(); | ||
} | ||
|
||
// Validate space compatibility with embedding function | ||
if (effectiveEmbeddingFunction.supportedSpaces) { | ||
const supportedSpaces = effectiveEmbeddingFunction.supportedSpaces(); | ||
|
||
if (configuration?.hnsw?.space && !supportedSpaces.includes(configuration.hnsw.space)) { | ||
throw new ChromaValueError( | ||
`Space '${configuration.hnsw.space}' is not supported by embedding function '${effectiveEmbeddingFunction.name || 'unknown'}'. ` + | ||
`Supported spaces: ${supportedSpaces.join(', ')}` | ||
); | ||
} | ||
|
||
if (configuration?.spann?.space && !supportedSpaces.includes(configuration.spann.space)) { | ||
throw new ChromaValueError( | ||
`Space '${configuration.spann.space}' is not supported by embedding function '${configuration.spann.space}'. ` + | ||
`Supported spaces: ${supportedSpaces.join(', ')}` | ||
); | ||
} | ||
|
||
// Validate space compatibility with metadata legacy space config | ||
if (metadata && metadata["hnsw:space"] && !supportedSpaces.includes(metadata["hnsw:space"])) { | ||
throw new ChromaValueError( | ||
`Space '${metadata["hnsw:space"]}' from metadata is not supported by embedding function '${effectiveEmbeddingFunction.name || 'unknown'}'. ` + | ||
`Supported spaces: ${supportedSpaces.join(', ')}` | ||
); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Similar to the Python implementation, the logic here for validating the distance space (space
) could lead to incorrect validation failures due to its handling of precedence. The code independently validates the space from the configuration
object and the legacy metadata
.
If a user provides a valid space in configuration.hnsw.space
but also has an outdated, invalid hnsw:space
in metadata
, this function will throw an error, even though the metadata
value would be correctly ignored in favor of the configuration
value.
To improve this, the logic should first determine the single effective space
value by checking configuration
, then metadata
, and finally the embedding function's default. Only this final, effective value should be validated. This would align with the principle that the configuration
object is the primary source of truth and metadata
is for backward compatibility.
dea4784
to
30be1d5
Compare
12bdb8b
to
5cdede2
Compare
5cdede2
to
77ade2c
Compare
clients/new-js/packages/chromadb/src/collection-configuration.ts
Outdated
Show resolved
Hide resolved
77ade2c
to
b147d4d
Compare
}: { | ||
configuration?: CreateCollectionConfiguration; | ||
embeddingFunction?: EmbeddingFunction | null; | ||
metadata?: Record<string, any>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a Metadata
type
28c7e00
to
1ae0a5b
Compare
# Validate space compatibility with embedding function | ||
if hnsw_config is not None: | ||
if hnsw_config.get("space") not in ef.supported_spaces(): | ||
raise InvalidSpaceError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should error, if you override, we should just warn, you may know what you are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In spirit of https://en.wikipedia.org/wiki/Robustness_principle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -383,12 +386,47 @@ def create_collection_configuration_to_json( | |||
if ef.is_legacy(): | |||
ef_config = {"type": "legacy"} | |||
else: | |||
if hnsw_config is None and spann_config is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment a the top level explaining the intended fallback strategy would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -383,12 +386,47 @@ def create_collection_configuration_to_json( | |||
if ef.is_legacy(): | |||
ef_config = {"type": "legacy"} | |||
else: | |||
if hnsw_config is None and spann_config is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you set both? Should we error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats handled earlier in the client, itll error saying you can only set one
if hnsw_config is not None and spann_config is not None: |
1ae0a5b
to
65ec09a
Compare
UserWarning, | ||
stacklevel=2, | ||
) | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this last branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a user couldve set the config with some values but also the space in the metadata. in that case, we've determined to ignore the metadata, so we should only validate and warn if we're going to use the space (which would only happen if neither hnsw nor spann were specified
65ec09a
to
1e04570
Compare
Description of changes
This PR adds support for using default space defined in the embedding function when populating collection config if not provided by the user. It also adds validation for supported space in ef in case the user defines a space that is not supported by the embedding function and/or model
Test plan
Added unit tests in js and
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
No
Observability plan
What is the plan to instrument and monitor this change?
N/A
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?
No changes needed