-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: add garbage collection CLI for manual garbage collection #5250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Add GC CLI Tool: Manual Garbage Collection, SysDb Import/Export, and Collection File Download This PR introduces a new CLI tool ( Key Changes• Added Affected Areas• This summary was automatically generated by @propel-code-bot |
for version in decoded_version_file.version_history.unwrap().versions { | ||
for segment in version.segment_info.unwrap().segment_compaction_info { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Using .unwrap()
here could cause the tool to panic if version_history
or segment_info
are None
in the decoded protobuf file. This can make debugging harder if you encounter a malformed or unexpected version file.
Consider using a more robust way to handle these Option
s, for example by using into_iter().flat_map()
to gracefully handle None
cases and make the code more concise.
for version in decoded_version_file
.version_history
.into_iter()
.flat_map(|vh| vh.versions)
{
for segment in version
.segment_info
.into_iter()
.flat_map(|si| si.segment_compaction_info)
{
// ... rest of the loop
}
}
} | ||
|
||
bar.inc(1); | ||
tokio::time::sleep(std::time::Duration::from_millis(100)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
This sleep
seems intended to help with the progress bar animation, but it will unnecessarily slow down the concurrent downloads. The bar.enable_steady_tick(...)
call should be sufficient to keep the spinner animated without introducing a delay for every downloaded file. I recommend removing this line, and the similar one on line 378, to improve performance.
root_collection_id: row | ||
.get::<Option<String>, _>("root_collection_id") | ||
.map(|s| s.to_string()), | ||
lineage_file_name: row | ||
.get::<Option<String>, _>("lineage_file_name") | ||
.map(|s| s.to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NitPick]
The .map(|s| s.to_string())
calls here are redundant. row.get::<Option<String>, _>(...)
already returns an Option<String>
, so there's no need to map over it to convert the inner String
to a String
.
root_collection_id: row | |
.get::<Option<String>, _>("root_collection_id") | |
.map(|s| s.to_string()), | |
lineage_file_name: row | |
.get::<Option<String>, _>("lineage_file_name") | |
.map(|s| s.to_string()), | |
root_collection_id: row.get::<Option<String>, _>("root_collection_id"), | |
lineage_file_name: row.get::<Option<String>, _>("lineage_file_name"), |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
@@ -1,5 +1,558 @@ | |||
use chroma_blockstore::RootManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This file has grown quite large and contains the logic for multiple distinct CLI commands. For better maintainability and separation of concerns, consider refactoring each command into its own module.
This would involve creating a commands
module within the binary's source directory and moving the logic for each subcommand into its own file (e.g., garbage_collect.rs
, download_collections.rs
). The SerializedCollection
struct could also be moved to a shared location within this new module.
Example structure:
src/bin/garbage_collector_tool.rs
:
mod commands;
// ... clap setup ...
#[tokio::main]
async fn main() {
// ... arg parsing ...
match args.command {
GarbageCollectorCommand::GarbageCollect { ... } => commands::garbage_collect::run(...).await,
GarbageCollectorCommand::DownloadCollections { ... } => commands::download_collections::run(...).await,
// ... etc
}
}
src/bin/commands/mod.rs
:
pub mod garbage_collect;
pub mod download_collections;
// ... other command modules
// Shared structs like SerializedCollection could live here or in a types.rs
#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub(crate) struct SerializedCollection { /* ... */ }
src/bin/commands/garbage_collect.rs
:
// ... imports ...
pub(crate) async fn run(/*...args...*/) {
// Logic for the garbage-collect command
}
This approach will make the tool easier to extend and maintain in the future.
Description of changes
Adds a CLI for garbage collection utilities. Most utilities are not directly related to garbage collection so I'm happy to move them if there's a better place to put them. The new tools:
garbage-collect
: manually run GC on a specific collectionexport-sysdb-collections
: connect to the SysDb Postgres database and export collection rows to a.jsonl
fileimport-sysdb-collections
: connect to the SysDb Postgres database and import collection rows from a.jsonl
filedownload-collections
: download all files from object storage associated with a set of collectionsTest plan
How are these changes tested?
Used new tools extensively to debug GC.
Migration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?