-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[DOC] add docs for collection forking #5229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Add Documentation for Chroma Cloud Collection Forking Feature This pull request introduces comprehensive documentation for the new collection forking feature in Chroma Cloud, describing its copy-on-write storage model, usage patterns, pricing, and quotas/limits. It adds a dedicated "Collection Forking" documentation page, updates relevant references in the pricing and quotas/limits docs, incorporates an explanatory diagram, and ensures discoverability via the sidebar navigation. Several rounds of feedback have been addressed to clarify copy semantics, quota behaviors, correct example code, and stabilize formatting and terminology throughout. Key Changes• Added docs/markdoc/content/cloud/collection-forking.md: new page detailing forking semantics, cost, quotas, and intended usage scenarios; includes example code and a workflow diagram. Affected Areas• Documentation content for Cloud features This summary was automatically generated by @propel-code-bot |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
# Create a forked collection. Name must be unique within the database. | ||
forked_collection = source_collection.fork(name="main-repo-index-pr-1234") | ||
|
||
# Forked collection is immediately queryable; changes are isolated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
Add missing article for grammatical correctness: change "Forked collection is immediately queryable; changes are isolated" to "The forked collection is immediately queryable; changes are isolated."
|
||
Forking lets you create a new collection from an existing one instantly, using copy-on-write under the hood. The forked collection initially shares its data with the source and only incurs additional storage for incremental changes you make afterward. | ||
|
||
{% Banner type="info" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"info" is not supported. It's "note" for yellow, "tip" for blue, and "warn" for red
|
||
{% /TabbedCodeBlock %} | ||
|
||
For a longer end-to-end demo, see the advanced forking example in the Chroma repo: [Forking notebook](https://github.com/chroma-core/chroma/blob/main/examples/advanced/forking.ipynb). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this notebook you can find a comprehensive demo, where we index a codebase in a Chroma collection, and use forking to efficiently create collections for new branches.
|
||
## Quotas and errors | ||
|
||
Forking is subject to a limit on the total number of fork edges from the root. This counts every edge in the fork graph from the root collection (e.g., A→B→C is 2; A→[B, C], B→D is 3). The current default limit is **4,096**. If you exceed it, the fork request returns a quota error for the `NUM_FORKS` rule — catch it and fall back to creating a new collection with a full copy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a diagram would be better here
## When to use forking | ||
|
||
- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves. | ||
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "index a branch by forking from its divergence point". "pull request" is more specific to GitHub.
Also if this refers specifically to code, change "entire dataset" to "entire codebase"?
|
||
- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves. | ||
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset. | ||
- **Experimentation**: Safely test schema changes, new embedding functions, or cleaning pipelines without touching production data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'd be able to test a new embedding function since the forked collection should have the same configuration?
|
||
## Notes | ||
|
||
- Forking is within the same database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your forked collections will belong to the same db as the source
|
||
- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves. | ||
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset. | ||
- **Git-like workflows**: For example, index a branch by forking from its divergence point, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
Remove duplicate bullet and fix indentation: two consecutive "Git-like workflows" bullets (lines 68-69) repeat the same heading, and the second one has an extra leading space. Merge or delete one to avoid confusion and ensure proper Markdown rendering.
|
||
## Quotas and errors | ||
|
||
Chroma limits the number of fork edges in your fork tree. Every time you call "fork", a new edge is created from the parent to the child. The count includes edges created by forks on the root collection and on any of its descendants; see the diagram below. The current default limit is **4,096** edges per tree. If you delete a collection, its edge remains in the tree and still counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
Remove the leading space at the start of this paragraph to prevent unintended indentation.
@@ -0,0 +1,80 @@ | |||
# Collection Forking | |||
|
|||
**Collection forking enables instant, zero-copy collection branching in Chroma Cloud.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the word branching is a bit odd, we called forking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i may not say zero-copy AND copy-on-write. I would just say "copy-on-write" everywehre
Forking lets you create a new collection from an existing one instantly, using copy-on-write under the hood. The forked collection initially shares its data with the source and only incurs additional storage for incremental changes you make afterward. | ||
|
||
{% Banner type="tip" %} | ||
**Forking is available in Chroma Cloud only.** The file system on single-node Chroma does not support forking — see [Single-Node Chroma: Performance and Limitations](../guides/deploy/performance). Chroma Cloud uses block storage that enables true copy-on-write semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Linking out to performance and limitations seems a bit odd.
- "Chroma Cloud uses block storage that enables true copy-on-write semantics." -> This doesn't quite make sense to me
1 Job Failed: PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)No logs available for this step. Summary: 1 successful workflow, 1 failed workflow
Last updated: 2025-08-08 21:59:20 UTC |
Documents forking feature for docs.trychroma.com
Preview: https://chroma-git-philipithomas-forking-docs-chromacore.vercel.app/cloud/collection-forking