Skip to content

[ENH]: add garbage collection CLI for manual garbage collection #5250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 13, 2025

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Aug 12, 2025

Description of changes

Adds a CLI for garbage collection utilities. Most utilities are not directly related to garbage collection so I'm happy to move them if there's a better place to put them. The new tools:

  • garbage-collect: manually run GC on a specific collection
  • export-sysdb-collections: connect to the SysDb Postgres database and export collection rows to a .jsonl file
  • import-sysdb-collections: connect to the SysDb Postgres database and import collection rows from a .jsonl file
  • download-collections: download all files from object storage associated with a set of collections

Test plan

How are these changes tested?

Used new tools extensively to debug GC.

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb marked this pull request as ready for review August 13, 2025 21:11
Copy link
Contributor

propel-code-bot bot commented Aug 13, 2025

Add GC CLI Tool: Manual Garbage Collection, SysDb Import/Export, and Collection File Download

This PR introduces a new CLI tool (garbage_collector_tool) to the Chroma garbage collector Rust crate. The tool provides multiple subcommands: running manual garbage collection on a single collection, exporting collections from SysDb (Postgres) to a JSONL file, importing collections from a JSONL file into SysDb, and downloading all files from object storage associated with specified collections. These commands are intended for manual maintenance, migration, debugging, and data recovery tasks, and interact directly with the system's database and storage layers. Supporting changes include updated dependency management and slight refactoring to make configuration types accessible across binaries.

Key Changes

• Added garbage_collector_tool`.rs` binary implementing a CLItool using `clap`, with subcommands for manualGC, export/import of collections, and downloading collection artifacts from object storage. • Manual GCruns v2 orchestrator logic for a single collection with configurable cleanup mode and cutoffs; retains min-version logic. • Export collections: connects to `SysDb` Postgres, supports exporting a collection (and optionally its children) to .jsonl; import supports restoring from such a file while ensuring referential integrity for tenants/databases. • Download collections: given exported collection metadata, downloads version/lineage files and sparse indices (but not blocks, which is unimplemented) from object storage with concurrent progress feedback. • Extended dependencies: adds `clap`, `indicatif`, `sqlx` (with Postgres and chrono support), and exposes config struct as public forCLI`` consumption.
• Minor refactor: makes config struct fields public and moves module boundary (lib, bin) to support shared code.
• Cargo and lockfile updated for new dependencies.

Affected Areas

garbage_collector_tool CLI (new binary)
garbage_collector config (broader visibility, public fields)
• build scripts and dependency management (Cargo.toml, Cargo.lock)
• manual operation/maintenance/documentation

This summary was automatically generated by @propel-code-bot

Comment on lines +316 to +317
for version in decoded_version_file.version_history.unwrap().versions {
for segment in version.segment_info.unwrap().segment_compaction_info {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Using .unwrap() here could cause the tool to panic if version_history or segment_info are None in the decoded protobuf file. This can make debugging harder if you encounter a malformed or unexpected version file.

Consider using a more robust way to handle these Options, for example by using into_iter().flat_map() to gracefully handle None cases and make the code more concise.

for version in decoded_version_file
    .version_history
    .into_iter()
    .flat_map(|vh| vh.versions)
{
    for segment in version
        .segment_info
        .into_iter()
        .flat_map(|si| si.segment_compaction_info)
    {
        // ... rest of the loop
    }
}

}

bar.inc(1);
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

This sleep seems intended to help with the progress bar animation, but it will unnecessarily slow down the concurrent downloads. The bar.enable_steady_tick(...) call should be sufficient to keep the spinner animated without introducing a delay for every downloaded file. I recommend removing this line, and the similar one on line 378, to improve performance.

Comment on lines 445 to 450
root_collection_id: row
.get::<Option<String>, _>("root_collection_id")
.map(|s| s.to_string()),
lineage_file_name: row
.get::<Option<String>, _>("lineage_file_name")
.map(|s| s.to_string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NitPick]

The .map(|s| s.to_string()) calls here are redundant. row.get::<Option<String>, _>(...) already returns an Option<String>, so there's no need to map over it to convert the inner String to a String.

Suggested change
root_collection_id: row
.get::<Option<String>, _>("root_collection_id")
.map(|s| s.to_string()),
lineage_file_name: row
.get::<Option<String>, _>("lineage_file_name")
.map(|s| s.to_string()),
root_collection_id: row.get::<Option<String>, _>("root_collection_id"),
lineage_file_name: row.get::<Option<String>, _>("lineage_file_name"),

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

@@ -1,5 +1,558 @@
use chroma_blockstore::RootManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This file has grown quite large and contains the logic for multiple distinct CLI commands. For better maintainability and separation of concerns, consider refactoring each command into its own module.

This would involve creating a commands module within the binary's source directory and moving the logic for each subcommand into its own file (e.g., garbage_collect.rs, download_collections.rs). The SerializedCollection struct could also be moved to a shared location within this new module.

Example structure:

src/bin/garbage_collector_tool.rs:

mod commands;
// ... clap setup ...

#[tokio::main]
async fn main() {
    // ... arg parsing ...
    match args.command {
        GarbageCollectorCommand::GarbageCollect { ... } => commands::garbage_collect::run(...).await,
        GarbageCollectorCommand::DownloadCollections { ... } => commands::download_collections::run(...).await,
        // ... etc
    }
}

src/bin/commands/mod.rs:

pub mod garbage_collect;
pub mod download_collections;
// ... other command modules

// Shared structs like SerializedCollection could live here or in a types.rs
#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub(crate) struct SerializedCollection { /* ... */ }

src/bin/commands/garbage_collect.rs:

// ... imports ...

pub(crate) async fn run(/*...args...*/) {
    // Logic for the garbage-collect command
}

This approach will make the tool easier to extend and maintain in the future.

@codetheweb codetheweb enabled auto-merge (squash) August 13, 2025 21:48
@codetheweb codetheweb disabled auto-merge August 13, 2025 21:48
@codetheweb codetheweb enabled auto-merge (squash) August 13, 2025 21:48
@codetheweb codetheweb merged commit 9e9c91a into main Aug 13, 2025
59 checks passed
@blacksmith-sh blacksmith-sh bot deleted a comment from codetheweb Aug 13, 2025
@codetheweb codetheweb deleted the feat-gc-cli-manual-gc branch August 13, 2025 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants