Skip to content

[ENH] Add support for default space in create coll config #5293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 18, 2025

Conversation

jairad26
Copy link
Contributor

@jairad26 jairad26 commented Aug 16, 2025

Description of changes

This PR adds support for using default space defined in the embedding function when populating collection config if not provided by the user. It also adds validation for supported space in ef in case the user defines a space that is not supported by the embedding function and/or model

Test plan

Added unit tests in js and

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

No

Observability plan

What is the plan to instrument and monitor this change?

N/A

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

No changes needed

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@jairad26 jairad26 marked this pull request as ready for review August 16, 2025 20:20
Copy link
Contributor

propel-code-bot bot commented Aug 16, 2025

Support Default Embedding Space in Collection Configuration, with Space Validation

This PR introduces enhanced logic for handling default vector spaces in collection configurations when creating collections in ChromaDB. If a user does not explicitly provide a distance space (e.g., 'cosine', 'l2') in the configuration or metadata, the system now sets it to the embedding function's defined default. Furthermore, the changes add validation to ensure that any user-provided space is supported by the relevant embedding function/model; unsupported combinations produce warnings (not errors) to adhere to robustness principles. Accompanying this feature are major test expansions for both Python and JavaScript, and updates to client, serialization, server/database, and embedding function logic across the stack.

Key Changes

• Collection configuration logic now uses the embedding functions default space if space isnt set in the HNSW/SPANN config or legacy metadata.
• Validation warns (rather than errors) if the requested space (from config or metadata) is not supported by the embedding functions ``supported_spaces' list.
• Explicit handling and prioritization between configuration, legacy metadata, and default space for setting the collection's distance metric.
• Updates to test suites in Python and TypeScript/Jest to cover default space selection, valid and invalid space combinations, prioritization logic, and warnings.
• Minor corrections in embedding function implementations to ensure supported_spaces and default_space methods exist where needed.
• Client, server (HTTP, gRPC, Rust, Segment), and system/database layers updated to pass metadata where needed and ensure space population/validation logic is preserved across APIs.
• pnpm lockfile and dependency adjustments for JS client compatibility.

Affected Areas

• chromadb/api/collection_configuration.py (core config logic and validation)
• chromadb/db/mixins/sysdb.py, chromadb/db/impl/grpc/client.py, chromadb/api/fastapi.py, chromadb/api/async_fastapi.py, chromadb/api/rust.py, chromadb/api/segment.py (all updated to handle/propagate metadata and config)
• utils/embedding_functions (baseten and cohere EF: add/adjust default_space and supported_spaces)
• clients/new-js/packages/chromadb/src/collection-configuration.ts (JS config logic and validation)
• clients/new-js/packages/chromadb/src/chroma-client.ts (propagating metadata to processCreateCollectionConfig)
• clients/new-js/packages/chromadb/test/collection.config.client.test.ts (comprehensive new and updated tests)
• chromadb/test/configurations/test_collection_configuration.py (comprehensive new and updated tests)
• clients/new-js/pnpm-lock.yaml (dependency update)

This summary was automatically generated by @propel-code-bot

Comment on lines +392 to +404
if hnsw_config is not None and hnsw_config.get("space") is None:
hnsw_config["space"] = ef.default_space()
if spann_config is not None and spann_config.get("space") is None:
spann_config["space"] = ef.default_space()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The logic for handling hnsw_config and spann_config mutations can lead to unexpected side effects. When you modify these dictionaries with hnsw_config["space"] = ef.default_space(), you're potentially mutating the original input parameters, which could affect the caller's data.

Consider creating copies before modification:

Suggested change
if hnsw_config is not None and hnsw_config.get("space") is None:
hnsw_config["space"] = ef.default_space()
if spann_config is not None and spann_config.get("space") is None:
spann_config["space"] = ef.default_space()
# if hnsw config or spann config exists but space is not provided, populate it from ef
if hnsw_config is not None and hnsw_config.get("space") is None:
hnsw_config = dict(hnsw_config) # Create a copy to avoid mutation
hnsw_config["space"] = ef.default_space()
if spann_config is not None and spann_config.get("space") is None:
spann_config = dict(spann_config) # Create a copy to avoid mutation
spann_config["space"] = ef.default_space()

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point @jai, we shouldn't mutate callers. Suggest higher up in the code we copy or remove the mutation. Imagine I create many collections with the same config

@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Aug 16, 2025
Comment on lines 78 to 128
// Get the effective embedding function for space validation and defaults
const effectiveEmbeddingFunction = embeddingFunction || configuration?.embeddingFunction;

if (effectiveEmbeddingFunction && effectiveEmbeddingFunction.defaultSpace && effectiveEmbeddingFunction.supportedSpaces) {
// If no vector index config exists but we have an embedding function,
// create default HNSW config with space from embedding function
if (configuration?.hnsw === undefined && configuration?.spann === undefined) {
if (metadata === undefined || metadata?.["hnsw:space"] === undefined) {
// this populates space from ef if not provided in either config or metadata
if (!configuration) configuration = {};
configuration.hnsw = { space: effectiveEmbeddingFunction.defaultSpace() };
}
}

// If HNSW config exists but space is not provided, populate it from embedding function
if (configuration?.hnsw && !configuration.hnsw.space && effectiveEmbeddingFunction.defaultSpace) {
configuration.hnsw.space = effectiveEmbeddingFunction.defaultSpace();
}

// If SPANN config exists but space is not provided, populate it from embedding function
if (configuration?.spann && !configuration.spann.space && effectiveEmbeddingFunction.defaultSpace) {
configuration.spann.space = effectiveEmbeddingFunction.defaultSpace();
}

// Validate space compatibility with embedding function
if (effectiveEmbeddingFunction.supportedSpaces) {
const supportedSpaces = effectiveEmbeddingFunction.supportedSpaces();

if (configuration?.hnsw?.space && !supportedSpaces.includes(configuration.hnsw.space)) {
throw new ChromaValueError(
`Space '${configuration.hnsw.space}' is not supported by embedding function '${effectiveEmbeddingFunction.name || 'unknown'}'. ` +
`Supported spaces: ${supportedSpaces.join(', ')}`
);
}

if (configuration?.spann?.space && !supportedSpaces.includes(configuration.spann.space)) {
throw new ChromaValueError(
`Space '${configuration.spann.space}' is not supported by embedding function '${configuration.spann.space}'. ` +
`Supported spaces: ${supportedSpaces.join(', ')}`
);
}

// Validate space compatibility with metadata legacy space config
if (metadata && metadata["hnsw:space"] && !supportedSpaces.includes(metadata["hnsw:space"])) {
throw new ChromaValueError(
`Space '${metadata["hnsw:space"]}' from metadata is not supported by embedding function '${effectiveEmbeddingFunction.name || 'unknown'}'. ` +
`Supported spaces: ${supportedSpaces.join(', ')}`
);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Similar to the Python implementation, the logic here for validating the distance space (space) could lead to incorrect validation failures due to its handling of precedence. The code independently validates the space from the configuration object and the legacy metadata.

If a user provides a valid space in configuration.hnsw.space but also has an outdated, invalid hnsw:space in metadata, this function will throw an error, even though the metadata value would be correctly ignored in favor of the configuration value.

To improve this, the logic should first determine the single effective space value by checking configuration, then metadata, and finally the embedding function's default. Only this final, effective value should be validated. This would align with the principle that the configuration object is the primary source of truth and metadata is for backward compatibility.

@jairad26 jairad26 force-pushed the jai/default-space branch 2 times, most recently from 12bdb8b to 5cdede2 Compare August 17, 2025 19:25
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Aug 17, 2025
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Aug 17, 2025
}: {
configuration?: CreateCollectionConfiguration;
embeddingFunction?: EmbeddingFunction | null;
metadata?: Record<string, any>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a Metadata type

@jairad26 jairad26 force-pushed the jai/default-space branch 4 times, most recently from 28c7e00 to 1ae0a5b Compare August 18, 2025 01:39
# Validate space compatibility with embedding function
if hnsw_config is not None:
if hnsw_config.get("space") not in ef.supported_spaces():
raise InvalidSpaceError(
Copy link
Collaborator

@HammadB HammadB Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should error, if you override, we should just warn, you may know what you are doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -383,12 +386,47 @@ def create_collection_configuration_to_json(
if ef.is_legacy():
ef_config = {"type": "legacy"}
else:
if hnsw_config is None and spann_config is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment a the top level explaining the intended fallback strategy would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -383,12 +386,47 @@ def create_collection_configuration_to_json(
if ef.is_legacy():
ef_config = {"type": "legacy"}
else:
if hnsw_config is None and spann_config is None:
Copy link
Collaborator

@HammadB HammadB Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you set both? Should we error?

Copy link
Contributor Author

@jairad26 jairad26 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats handled earlier in the client, itll error saying you can only set one

if hnsw_config is not None and spann_config is not None:

UserWarning,
stacklevel=2,
)
if (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this last branch

Copy link
Contributor Author

@jairad26 jairad26 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a user couldve set the config with some values but also the space in the metadata. in that case, we've determined to ignore the metadata, so we should only validate and warn if we're going to use the space (which would only happen if neither hnsw nor spann were specified

@jairad26 jairad26 enabled auto-merge (squash) August 18, 2025 04:24
@jairad26 jairad26 merged commit 51a7d16 into main Aug 18, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants