Support bundle: automatic deletion to maintain free dataset buffer#9662
Support bundle: automatic deletion to maintain free dataset buffer#9662
Conversation
5bc38cc to
c78018e
Compare
c78018e to
1c94bc0
Compare
- Use COUNT queries instead of loading all dataset/bundle rows - Use GROUP BY to get used_datasets and active_bundles in a single query - Add LIMIT to the deletion candidates query - Exclude Failed bundles from used_datasets (their dataset was expunged) These changes allow the auto-deletion query to scale to systems with thousands of datasets without loading all rows into memory.
Replace the two-step find-then-update approach with a single atomic CTE query that calculates how many deletions are needed AND performs them in one database operation. The previous approach had a time-of-check to time-of-use (TOCTTOU) issue where multiple Nexuses running concurrently could over-delete bundles: 1. Nexus A queries: free=2, needs 1 deletion -> gets B1 2. Nexus A transitions B1: Active -> Destroying 3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2 4. Nexus B transitions B2: Active -> Destroying 5. Result: 2 bundles deleted when only 1 was needed The new atomic query: - Calculates free datasets and needed deletions - Respects min_bundles_to_keep constraint - Finds the N oldest Active bundles - Transitions them to Destroying state atomically - Returns the IDs of deleted bundles Also adds tests verifying: - Bundles are actually transitioned to Destroying state - Failed bundles don't count as occupying datasets - Explain test for SQL validity - Expectorate test for SQL output
hawkw
left a comment
There was a problem hiding this comment.
I had a bunch of very insignificant style nits, but overall, I like the approach used here.
| println!(" Total debug datasets: {total_datasets}"); | ||
| println!(" Active bundles: {active_bundles}"); | ||
| println!(" Free datasets: {free_datasets}"); | ||
| println!( | ||
| " Bundles marked for deletion: {bundles_marked_for_deletion}" |
There was a problem hiding this comment.
super unimportant nit: this will currently produce output formatted like:
Support Bundle Auto-Deletion Report:
Total debug datasets: 16
Active bundles: 2
Free datasets: 14
Bundles marked for deletion: 0
meanwhile, a lot of the existing background task status commands will align all the numbers in output like this into a column. it might be nice if this output also did that, like this:
Support Bundle Auto-Deletion Report:
Total debug datasets: 16
Active bundles: 2
Free datasets: 14
Bundles marked for deletion: 0
There was a problem hiding this comment.
also, I might consider changing this to nest "Free datasets" under "Total debug datasets", since that's a subset, and similarly, put "Bundles marked for deletion" under "Active bundles"?
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! omdb commands for support bundle auto-deletion configuration |
There was a problem hiding this comment.
turbo nit: mayhaps this should just be
| //! omdb commands for support bundle auto-deletion configuration | |
| //! omdb commands for support bundle configuration |
since it's currently only configuring auto-completion but it's entirely possible that other configs might be set through this later?
| .support_bundle_config_get(opctx) | ||
| .await | ||
| .context("failed to get current support bundle config")?; | ||
|
|
There was a problem hiding this comment.
i wondered if, perhaps, we ought to have some code in here that validates whether the new values are in the range 0-100, but i see that there are CHECK constraints for that in the db schema, and those will fail when we try to set the new values. do we think that the error that bubbles up from the database query will be clear enough to the user in that case, or should we be pre-validating the values here?
if we decide to just rely on the CHECK constraints for this, maybe we should leave a comment here saying that we don't need to validate the desired percentages now since they get CHECKed when we try to go set them?
| // Return type: (total_datasets, used_datasets, active_bundles, deleted_ids) | ||
| let result: (i64, i64, i64, Vec<Uuid>) = query | ||
| .get_result_async(&*conn) | ||
| .await | ||
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?; | ||
|
|
||
| let total_datasets = result.0 as usize; | ||
| let used_datasets = result.1 as usize; | ||
| let active_bundles = result.2 as usize; | ||
| let deleted_ids: Vec<SupportBundleUuid> = result | ||
| .3 | ||
| .into_iter() | ||
| .map(SupportBundleUuid::from_untyped_uuid) | ||
| .collect(); | ||
|
|
||
| Ok(AutoDeletionResult { | ||
| deleted_ids, | ||
| total_datasets, | ||
| active_bundles, |
There was a problem hiding this comment.
personally, i feel like this would read a bit better if it destructured the tuple rather than using indices, like:
| // Return type: (total_datasets, used_datasets, active_bundles, deleted_ids) | |
| let result: (i64, i64, i64, Vec<Uuid>) = query | |
| .get_result_async(&*conn) | |
| .await | |
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?; | |
| let total_datasets = result.0 as usize; | |
| let used_datasets = result.1 as usize; | |
| let active_bundles = result.2 as usize; | |
| let deleted_ids: Vec<SupportBundleUuid> = result | |
| .3 | |
| .into_iter() | |
| .map(SupportBundleUuid::from_untyped_uuid) | |
| .collect(); | |
| Ok(AutoDeletionResult { | |
| deleted_ids, | |
| total_datasets, | |
| active_bundles, | |
| let (total_datasets, used_datasets, active_bundles, deleted_ids): (i64, i64, i64, Vec<Uuid>) = query | |
| .get_result_async(&*conn) | |
| .await | |
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?; | |
| let deleted_ids: Vec<SupportBundleUuid> = deleted_ids | |
| .into_iter() | |
| .map(SupportBundleUuid::from_untyped_uuid) | |
| .collect(); | |
| Ok(AutoDeletionResult { | |
| deleted_ids, | |
| total_datasets: total_datasets as usize, | |
| active_bundles: active_bundles as usize, |
| // Validate percentages are in range | ||
| if target_free_percent > 100 { | ||
| return Err(Error::invalid_request( | ||
| "target_free_percent must be between 0 and 100", | ||
| )); | ||
| } | ||
| if min_keep_percent > 100 { | ||
| return Err(Error::invalid_request( | ||
| "min_keep_percent must be between 0 and 100", | ||
| )); | ||
| } |
There was a problem hiding this comment.
won't the CHECK constraint also do this for us? or maybe it gives us a nicer looking error message...
| // Test automatic deletion of support bundles to maintain free dataset capacity. | ||
| // | ||
| // This test verifies that: | ||
| // 1. Auto-deletion kicks in when free_datasets < target threshold | ||
| // 2. The oldest bundles are marked for deletion first | ||
| // 3. min_keep percentage is respected | ||
| // | ||
| // Configuration: 20% target_free (CEIL(5*20/100)=1), 20% min_keep (CEIL(5*20/100)=1), 5 datasets | ||
| // Create 5 bundles (filling all datasets), and verify auto-deletion happens | ||
| // when we run out of free datasets. |
There was a problem hiding this comment.
would really like it if this comment was wrapped at column 80 (sorry)
| // - When creating bundle 5: | ||
| // - Before collection: 4 Active + 1 Collecting = 5 bundles, free = 0 | ||
| // - Auto-delete triggers: want 1 free, have 0, delete 1 oldest | ||
| // - 20% min_keep (CEIL(5*20/100)=1), active=4, max_deletable=3, so we CAN delete |
There was a problem hiding this comment.
another line that i think ought to be wrapped
| CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle ( | ||
| state, | ||
| time_created | ||
| ); | ||
|
|
||
| /* | ||
| * Support Bundle Config | ||
| * | ||
| * Configuration for automatic support bundle deletion. This table uses a | ||
| * singleton pattern (exactly one row) to store cluster-wide configuration. | ||
| */ | ||
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | ||
| -- Singleton pattern: only one row allowed | ||
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | ||
|
|
||
| -- Percentage (0-100) of total datasets to keep free for new allocations. | ||
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | ||
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | ||
| target_free_percent INT8 NOT NULL | ||
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | ||
|
|
||
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | ||
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | ||
| -- Prevents aggressive cleanup on small systems. | ||
| min_keep_percent INT8 NOT NULL | ||
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | ||
|
|
||
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| ); | ||
|
|
||
| -- Default: 10% free datasets, keep at least 10% worth of bundles | ||
| INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified) | ||
| VALUES (TRUE, 10, 10, NOW()) | ||
| ON CONFLICT (singleton) DO NOTHING; |
There was a problem hiding this comment.
nitpick: can we wrap some of these lines? perhaps:
| CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle ( | |
| state, | |
| time_created | |
| ); | |
| /* | |
| * Support Bundle Config | |
| * | |
| * Configuration for automatic support bundle deletion. This table uses a | |
| * singleton pattern (exactly one row) to store cluster-wide configuration. | |
| */ | |
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | |
| -- Singleton pattern: only one row allowed | |
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | |
| -- Percentage (0-100) of total datasets to keep free for new allocations. | |
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | |
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | |
| target_free_percent INT8 NOT NULL | |
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | |
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | |
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | |
| -- Prevents aggressive cleanup on small systems. | |
| min_keep_percent INT8 NOT NULL | |
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | |
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | |
| ); | |
| -- Default: 10% free datasets, keep at least 10% worth of bundles | |
| INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified) | |
| VALUES (TRUE, 10, 10, NOW()) | |
| ON CONFLICT (singleton) DO NOTHING; | |
| CREATE INDEX IF NOT EXISTS | |
| lookup_bundle_by_state_and_creation | |
| ON omicron.public.support_bundle ( | |
| state, | |
| time_created | |
| ); | |
| /* | |
| * Support Bundle Config | |
| * | |
| * Configuration for automatic support bundle deletion. This table uses a | |
| * singleton pattern (exactly one row) to store cluster-wide configuration. | |
| */ | |
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config ( | |
| -- Singleton pattern: only one row allowed | |
| singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE), | |
| -- Percentage (0-100) of total datasets to keep free for new allocations. | |
| -- Calculated as CEIL(total_datasets * target_free_percent / 100). | |
| -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free. | |
| target_free_percent INT8 NOT NULL | |
| CHECK (target_free_percent >= 0 AND target_free_percent <= 100), | |
| -- Percentage (0-100) of total datasets to retain as bundles (minimum). | |
| -- Calculated as CEIL(total_datasets * min_keep_percent / 100). | |
| -- Prevents aggressive cleanup on small systems. | |
| min_keep_percent INT8 NOT NULL | |
| CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100), | |
| time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW() | |
| ); | |
| -- Default: 10% free datasets, keep at least 10% worth of bundles | |
| INSERT INTO omicron.public.support_bundle_config ( | |
| singleton, | |
| target_free_percent, | |
| min_keep_percent, | |
| time_modified | |
| ) | |
| VALUES (TRUE, 10, 10, NOW()) | |
| ON CONFLICT (singleton) DO NOTHING; |
Summary
Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.
support_bundle_configtable withtarget_free_percentandmin_keep_percentsupport_bundle_auto_delete()CTE + unit testsauto_delete_bundles()phase before existing phasesHow it works
The
SupportBundleCollectorbackground task now runs three phases:Configuration uses percentages of total datasets (calculated with CEIL rounding)
target_free_percent: Target % of datasets to keep free for future bundles (default: 10%)min_keep_percent: Minimum % of datasets to retain (default: 10%)Tests
min_keep_percentconstraint limits maximum deletionstest_support_bundle_auto_deletion()Fixes #9660