Skip to content

[DOC] add docs for collection forking #5229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 8, 2025
Merged

Conversation

philipithomas
Copy link
Member

@philipithomas philipithomas commented Aug 7, 2025

Copy link

github-actions bot commented Aug 7, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

propel-code-bot bot commented Aug 7, 2025

Add Documentation for Chroma Cloud Collection Forking Feature

This pull request introduces comprehensive documentation for the new collection forking feature in Chroma Cloud, describing its copy-on-write storage model, usage patterns, pricing, and quotas/limits. It adds a dedicated "Collection Forking" documentation page, updates relevant references in the pricing and quotas/limits docs, incorporates an explanatory diagram, and ensures discoverability via the sidebar navigation. Several rounds of feedback have been addressed to clarify copy semantics, quota behaviors, correct example code, and stabilize formatting and terminology throughout.

Key Changes

• Added docs/markdoc/content/cloud/collection-forking.md: new page detailing forking semantics, cost, quotas, and intended usage scenarios; includes example code and a workflow diagram.
• Modified docs/markdoc/content/cloud/pricing.md: added a section about forking costs and linked to the forking documentation.
• Updated docs/markdoc/content/cloud/quotas-limits.md: added the fork edges quota (4,096) and a link to further details in the forking documentation.
• Updated docs/markdoc/content/sidebar-config.ts: added Collection Forking to the Cloud doc section navigation sidebar.
• Added/modified diagram assets (e.g., fork-edges-light.png, fork-edges-dark.png) to visually illustrate fork edge quota/structure.

Affected Areas

• Documentation content for Cloud features
• Pricing documentation
• Quota/limits documentation
• Sidebar navigation/configuration
• Image assets for Cloud documentation

This summary was automatically generated by @propel-code-bot

Copy link

vercel bot commented Aug 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
chroma ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 8, 2025 8:46pm

# Create a forked collection. Name must be unique within the database.
forked_collection = source_collection.fork(name="main-repo-index-pr-1234")

# Forked collection is immediately queryable; changes are isolated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

Add missing article for grammatical correctness: change "Forked collection is immediately queryable; changes are isolated" to "The forked collection is immediately queryable; changes are isolated."


Forking lets you create a new collection from an existing one instantly, using copy-on-write under the hood. The forked collection initially shares its data with the source and only incurs additional storage for incremental changes you make afterward.

{% Banner type="info" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"info" is not supported. It's "note" for yellow, "tip" for blue, and "warn" for red


{% /TabbedCodeBlock %}

For a longer end-to-end demo, see the advanced forking example in the Chroma repo: [Forking notebook](https://github.com/chroma-core/chroma/blob/main/examples/advanced/forking.ipynb).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this notebook you can find a comprehensive demo, where we index a codebase in a Chroma collection, and use forking to efficiently create collections for new branches.


## Quotas and errors

Forking is subject to a limit on the total number of fork edges from the root. This counts every edge in the fork graph from the root collection (e.g., A→B→C is 2; A→[B, C], B→D is 3). The current default limit is **4,096**. If you exceed it, the fork request returns a quota error for the `NUM_FORKS` rule — catch it and fall back to creating a new collection with a full copy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a diagram would be better here

## When to use forking

- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves.
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "index a branch by forking from its divergence point". "pull request" is more specific to GitHub.

Also if this refers specifically to code, change "entire dataset" to "entire codebase"?


- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves.
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset.
- **Experimentation**: Safely test schema changes, new embedding functions, or cleaning pipelines without touching production data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you'd be able to test a new embedding function since the forked collection should have the same configuration?


## Notes

- Forking is within the same database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your forked collections will belong to the same db as the source


- **Data versioning/checkpointing**: Maintain consistent snapshots as your data evolves.
- **Git-like workflows**: For example, index a pull request by forking the main repository’s collection, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire dataset.
- **Git-like workflows**: For example, index a branch by forking from its divergence point, then apply the diff to the fork. This saves both write and storage costs compared to re-ingesting the entire codebase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

Remove duplicate bullet and fix indentation: two consecutive "Git-like workflows" bullets (lines 68-69) repeat the same heading, and the second one has an extra leading space. Merge or delete one to avoid confusion and ensure proper Markdown rendering.

@philipithomas philipithomas requested a review from itaismith August 8, 2025 20:42

## Quotas and errors

Chroma limits the number of fork edges in your fork tree. Every time you call "fork", a new edge is created from the parent to the child. The count includes edges created by forks on the root collection and on any of its descendants; see the diagram below. The current default limit is **4,096** edges per tree. If you delete a collection, its edge remains in the tree and still counts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

Remove the leading space at the start of this paragraph to prevent unintended indentation.

@@ -0,0 +1,80 @@
# Collection Forking

**Collection forking enables instant, zero-copy collection branching in Chroma Cloud.**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the word branching is a bit odd, we called forking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i may not say zero-copy AND copy-on-write. I would just say "copy-on-write" everywehre

Forking lets you create a new collection from an existing one instantly, using copy-on-write under the hood. The forked collection initially shares its data with the source and only incurs additional storage for incremental changes you make afterward.

{% Banner type="tip" %}
**Forking is available in Chroma Cloud only.** The file system on single-node Chroma does not support forking — see [Single-Node Chroma: Performance and Limitations](../guides/deploy/performance). Chroma Cloud uses block storage that enables true copy-on-write semantics.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Linking out to performance and limitations seems a bit odd.
  2. "Chroma Cloud uses block storage that enables true copy-on-write semantics." -> This doesn't quite make sense to me

Copy link
Member Author

1 Job Failed:

PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)

No logs available for this step.


Summary: 1 successful workflow, 1 failed workflow

Last updated: 2025-08-08 21:59:20 UTC

@philipithomas philipithomas merged commit fcf8d68 into main Aug 8, 2025
57 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants