Skip to content

Support bundle: automatic deletion to maintain free dataset buffer#9662

Open
smklein wants to merge 9 commits intomainfrom
support-bundle-auto-delete
Open

Support bundle: automatic deletion to maintain free dataset buffer#9662
smklein wants to merge 9 commits intomainfrom
support-bundle-auto-delete

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jan 15, 2026

Summary

Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.

  • Config: Added support_bundle_config table with target_free_percent and min_keep_percent
  • DB Query: Added support_bundle_auto_delete() CTE + unit tests
  • Background Task: Added auto_delete_bundles() phase before existing phases
  • Schema: Added index on (state, time_created) for efficient queries
  • omdb: Added nexus support-bundle-config show/set commands

How it works

The SupportBundleCollector background task now runs three phases:

  1. Auto-delete (new): Marks oldest Active bundles as Destroying when free datasets < target
  2. Cleanup (existing): Cleans up storage and DB for Destroying bundles
  3. Collect (existing): Collects pending bundles

Configuration uses percentages of total datasets (calculated with CEIL rounding)

  • target_free_percent: Target % of datasets to keep free for future bundles (default: 10%)
  • min_keep_percent: Minimum % of datasets to retain (default: 10%)

Tests

  • Deletion triggers only when free_datasets < target threshold
  • Oldest bundles (by time_created) are selected first
  • min_keep_percent constraint limits maximum deletions
  • Only Active bundles are candidates; Collecting/Destroying still occupy datasets
  • Failed bundles don't count toward used dataset count
  • SQL validity via EXPLAIN + expectorate for change detection
  • Integration test in test_support_bundle_auto_deletion()

Fixes #9660

@smklein smklein marked this pull request as draft January 15, 2026 21:19
@smklein smklein force-pushed the support-bundle-auto-delete branch from 5bc38cc to c78018e Compare January 15, 2026 21:26
@smklein smklein force-pushed the support-bundle-auto-delete branch from c78018e to 1c94bc0 Compare January 15, 2026 21:38
- Use COUNT queries instead of loading all dataset/bundle rows
- Use GROUP BY to get used_datasets and active_bundles in a single query
- Add LIMIT to the deletion candidates query
- Exclude Failed bundles from used_datasets (their dataset was expunged)

These changes allow the auto-deletion query to scale to systems with
thousands of datasets without loading all rows into memory.
Replace the two-step find-then-update approach with a single atomic CTE
query that calculates how many deletions are needed AND performs them
in one database operation.

The previous approach had a time-of-check to time-of-use (TOCTTOU) issue
where multiple Nexuses running concurrently could over-delete bundles:
1. Nexus A queries: free=2, needs 1 deletion -> gets B1
2. Nexus A transitions B1: Active -> Destroying
3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2
4. Nexus B transitions B2: Active -> Destroying
5. Result: 2 bundles deleted when only 1 was needed

The new atomic query:
- Calculates free datasets and needed deletions
- Respects min_bundles_to_keep constraint
- Finds the N oldest Active bundles
- Transitions them to Destroying state atomically
- Returns the IDs of deleted bundles

Also adds tests verifying:
- Bundles are actually transitioned to Destroying state
- Failed bundles don't count as occupying datasets
- Explain test for SQL validity
- Expectorate test for SQL output
@smklein smklein marked this pull request as ready for review January 22, 2026 20:06
@smklein smklein requested review from hawkw and papertigers January 22, 2026 20:06
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bunch of very insignificant style nits, but overall, I like the approach used here.

Comment on lines +2605 to +2609
println!(" Total debug datasets: {total_datasets}");
println!(" Active bundles: {active_bundles}");
println!(" Free datasets: {free_datasets}");
println!(
" Bundles marked for deletion: {bundles_marked_for_deletion}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super unimportant nit: this will currently produce output formatted like:

    Support Bundle Auto-Deletion Report:
      Total debug datasets: 16
      Active bundles: 2
      Free datasets: 14
      Bundles marked for deletion: 0

meanwhile, a lot of the existing background task status commands will align all the numbers in output like this into a column. it might be nice if this output also did that, like this:

    Support Bundle Auto-Deletion Report:
      Total debug datasets:        16
      Active bundles:              2
      Free datasets:               14
      Bundles marked for deletion: 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I might consider changing this to nest "Free datasets" under "Total debug datasets", since that's a subset, and similarly, put "Bundles marked for deletion" under "Active bundles"?

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! omdb commands for support bundle auto-deletion configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: mayhaps this should just be

Suggested change
//! omdb commands for support bundle auto-deletion configuration
//! omdb commands for support bundle configuration

since it's currently only configuring auto-completion but it's entirely possible that other configs might be set through this later?

.support_bundle_config_get(opctx)
.await
.context("failed to get current support bundle config")?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wondered if, perhaps, we ought to have some code in here that validates whether the new values are in the range 0-100, but i see that there are CHECK constraints for that in the db schema, and those will fail when we try to set the new values. do we think that the error that bubbles up from the database query will be clear enough to the user in that case, or should we be pre-validating the values here?

if we decide to just rely on the CHECK constraints for this, maybe we should leave a comment here saying that we don't need to validate the desired percentages now since they get CHECKed when we try to go set them?

Comment on lines +569 to +587
// Return type: (total_datasets, used_datasets, active_bundles, deleted_ids)
let result: (i64, i64, i64, Vec<Uuid>) = query
.get_result_async(&*conn)
.await
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;

let total_datasets = result.0 as usize;
let used_datasets = result.1 as usize;
let active_bundles = result.2 as usize;
let deleted_ids: Vec<SupportBundleUuid> = result
.3
.into_iter()
.map(SupportBundleUuid::from_untyped_uuid)
.collect();

Ok(AutoDeletionResult {
deleted_ids,
total_datasets,
active_bundles,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, i feel like this would read a bit better if it destructured the tuple rather than using indices, like:

Suggested change
// Return type: (total_datasets, used_datasets, active_bundles, deleted_ids)
let result: (i64, i64, i64, Vec<Uuid>) = query
.get_result_async(&*conn)
.await
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;
let total_datasets = result.0 as usize;
let used_datasets = result.1 as usize;
let active_bundles = result.2 as usize;
let deleted_ids: Vec<SupportBundleUuid> = result
.3
.into_iter()
.map(SupportBundleUuid::from_untyped_uuid)
.collect();
Ok(AutoDeletionResult {
deleted_ids,
total_datasets,
active_bundles,
let (total_datasets, used_datasets, active_bundles, deleted_ids): (i64, i64, i64, Vec<Uuid>) = query
.get_result_async(&*conn)
.await
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;
let deleted_ids: Vec<SupportBundleUuid> = deleted_ids
.into_iter()
.map(SupportBundleUuid::from_untyped_uuid)
.collect();
Ok(AutoDeletionResult {
deleted_ids,
total_datasets: total_datasets as usize,
active_bundles: active_bundles as usize,

Comment on lines +623 to +633
// Validate percentages are in range
if target_free_percent > 100 {
return Err(Error::invalid_request(
"target_free_percent must be between 0 and 100",
));
}
if min_keep_percent > 100 {
return Err(Error::invalid_request(
"min_keep_percent must be between 0 and 100",
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't the CHECK constraint also do this for us? or maybe it gives us a nicer looking error message...

Comment on lines +937 to +946
// Test automatic deletion of support bundles to maintain free dataset capacity.
//
// This test verifies that:
// 1. Auto-deletion kicks in when free_datasets < target threshold
// 2. The oldest bundles are marked for deletion first
// 3. min_keep percentage is respected
//
// Configuration: 20% target_free (CEIL(5*20/100)=1), 20% min_keep (CEIL(5*20/100)=1), 5 datasets
// Create 5 bundles (filling all datasets), and verify auto-deletion happens
// when we run out of free datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would really like it if this comment was wrapped at column 80 (sorry)

// - When creating bundle 5:
// - Before collection: 4 Active + 1 Collecting = 5 bundles, free = 0
// - Auto-delete triggers: want 1 free, have 0, delete 1 oldest
// - 20% min_keep (CEIL(5*20/100)=1), active=4, max_deletable=3, so we CAN delete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another line that i think ought to be wrapped

Comment on lines +3174 to +3207
CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (
state,
time_created
);

/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),

-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),

-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),

time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can we wrap some of these lines? perhaps:

Suggested change
CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (
state,
time_created
);
/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),
-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),
-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),
time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;
CREATE INDEX IF NOT EXISTS
lookup_bundle_by_state_and_creation
ON omicron.public.support_bundle (
state,
time_created
);
/*
* Support Bundle Config
*
* Configuration for automatic support bundle deletion. This table uses a
* singleton pattern (exactly one row) to store cluster-wide configuration.
*/
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
-- Singleton pattern: only one row allowed
singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),
-- Percentage (0-100) of total datasets to keep free for new allocations.
-- Calculated as CEIL(total_datasets * target_free_percent / 100).
-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
target_free_percent INT8 NOT NULL
CHECK (target_free_percent >= 0 AND target_free_percent <= 100),
-- Percentage (0-100) of total datasets to retain as bundles (minimum).
-- Calculated as CEIL(total_datasets * min_keep_percent / 100).
-- Prevents aggressive cleanup on small systems.
min_keep_percent INT8 NOT NULL
CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),
time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Default: 10% free datasets, keep at least 10% worth of bundles
INSERT INTO omicron.public.support_bundle_config (
singleton,
target_free_percent,
min_keep_percent,
time_modified
)
VALUES (TRUE, 10, 10, NOW())
ON CONFLICT (singleton) DO NOTHING;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind line-wrapping this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and wrapping this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Bundle: Automatic Deletion

2 participants