Support bundle: automatic deletion to maintain free dataset buffer by smklein · Pull Request #9662 · oxidecomputer/omicron

smklein · 2026-01-15T21:04:49Z

Summary

Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.

Config: Added support_bundle_config table with target_free_percent and min_keep_percent
DB Query: Added support_bundle_auto_delete() CTE + unit tests
Background Task: Added auto_delete_bundles() phase before existing phases
Schema: Added index on (state, time_created) for efficient queries
omdb: Added nexus support-bundle-config show/set commands

How it works

The SupportBundleCollector background task now runs three phases:

Auto-delete (new): Marks oldest Active bundles as Destroying when free datasets < target
Cleanup (existing): Cleans up storage and DB for Destroying bundles
Collect (existing): Collects pending bundles

Configuration uses percentages of total datasets (calculated with CEIL rounding)

target_free_percent: Target % of datasets to keep free for future bundles (default: 10%)
min_keep_percent: Minimum % of datasets to retain (default: 10%)

Tests

Deletion triggers only when free_datasets < target threshold
Oldest bundles (by time_created) are selected first
min_keep_percent constraint limits maximum deletions
Only Active bundles are candidates; Collecting/Destroying still occupy datasets
Failed bundles don't count toward used dataset count
SQL validity via EXPLAIN + expectorate for change detection
Integration test in test_support_bundle_auto_deletion()

Fixes #9660

- Use COUNT queries instead of loading all dataset/bundle rows - Use GROUP BY to get used_datasets and active_bundles in a single query - Add LIMIT to the deletion candidates query - Exclude Failed bundles from used_datasets (their dataset was expunged) These changes allow the auto-deletion query to scale to systems with thousands of datasets without loading all rows into memory.

Replace the two-step find-then-update approach with a single atomic CTE query that calculates how many deletions are needed AND performs them in one database operation. The previous approach had a time-of-check to time-of-use (TOCTTOU) issue where multiple Nexuses running concurrently could over-delete bundles: 1. Nexus A queries: free=2, needs 1 deletion -> gets B1 2. Nexus A transitions B1: Active -> Destroying 3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2 4. Nexus B transitions B2: Active -> Destroying 5. Result: 2 bundles deleted when only 1 was needed The new atomic query: - Calculates free datasets and needed deletions - Respects min_bundles_to_keep constraint - Finds the N oldest Active bundles - Transitions them to Destroying state atomically - Returns the IDs of deleted bundles Also adds tests verifying: - Bundles are actually transitioned to Destroying state - Failed bundles don't count as occupying datasets - Explain test for SQL validity - Expectorate test for SQL output

hawkw

I had a bunch of very insignificant style nits, but overall, I like the approach used here.

hawkw · 2026-02-03T19:55:07Z

dev-tools/omdb/src/bin/omdb/nexus.rs

+                println!("      Total debug datasets: {total_datasets}");
+                println!("      Active bundles: {active_bundles}");
+                println!("      Free datasets: {free_datasets}");
+                println!(
+                    "      Bundles marked for deletion: {bundles_marked_for_deletion}"


super unimportant nit: this will currently produce output formatted like:

Support Bundle Auto-Deletion Report: Total debug datasets: 16 Active bundles: 2 Free datasets: 14 Bundles marked for deletion: 0

meanwhile, a lot of the existing background task status commands will align all the numbers in output like this into a column. it might be nice if this output also did that, like this:

Support Bundle Auto-Deletion Report: Total debug datasets: 16 Active bundles: 2 Free datasets: 14 Bundles marked for deletion: 0

also, I might consider changing this to nest "Free datasets" under "Total debug datasets", since that's a subset, and similarly, put "Bundles marked for deletion" under "Active bundles"?

hawkw · 2026-02-03T19:58:12Z

dev-tools/omdb/src/bin/omdb/nexus/support_bundle_config.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! omdb commands for support bundle auto-deletion configuration


turbo nit: mayhaps this should just be

Suggested change

//! omdb commands for support bundle auto-deletion configuration

//! omdb commands for support bundle configuration

since it's currently only configuring auto-completion but it's entirely possible that other configs might be set through this later?

hawkw · 2026-02-03T20:01:13Z

dev-tools/omdb/src/bin/omdb/nexus/support_bundle_config.rs

+        .support_bundle_config_get(opctx)
+        .await
+        .context("failed to get current support bundle config")?;
+


i wondered if, perhaps, we ought to have some code in here that validates whether the new values are in the range 0-100, but i see that there are CHECK constraints for that in the db schema, and those will fail when we try to set the new values. do we think that the error that bubbles up from the database query will be clear enough to the user in that case, or should we be pre-validating the values here?

if we decide to just rely on the CHECK constraints for this, maybe we should leave a comment here saying that we don't need to validate the desired percentages now since they get CHECKed when we try to go set them?

hawkw · 2026-02-03T20:04:54Z

nexus/db-queries/src/db/datastore/support_bundle.rs

+        // Return type: (total_datasets, used_datasets, active_bundles, deleted_ids)
+        let result: (i64, i64, i64, Vec<Uuid>) = query
+            .get_result_async(&*conn)
+            .await
+            .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;
+
+        let total_datasets = result.0 as usize;
+        let used_datasets = result.1 as usize;
+        let active_bundles = result.2 as usize;
+        let deleted_ids: Vec<SupportBundleUuid> = result
+            .3
+            .into_iter()
+            .map(SupportBundleUuid::from_untyped_uuid)
+            .collect();
+
+        Ok(AutoDeletionResult {
+            deleted_ids,
+            total_datasets,
+            active_bundles,


personally, i feel like this would read a bit better if it destructured the tuple rather than using indices, like:

Suggested change

// Return type: (total_datasets, used_datasets, active_bundles, deleted_ids)

let result: (i64, i64, i64, Vec<Uuid>) = query

.get_result_async(&*conn)

.await

.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;

let total_datasets = result.0 as usize;

let used_datasets = result.1 as usize;

let active_bundles = result.2 as usize;

let deleted_ids: Vec<SupportBundleUuid> = result

.3

.into_iter()

.map(SupportBundleUuid::from_untyped_uuid)

.collect();

Ok(AutoDeletionResult {

deleted_ids,

total_datasets,

active_bundles,

let (total_datasets, used_datasets, active_bundles, deleted_ids): (i64, i64, i64, Vec<Uuid>) = query

.get_result_async(&*conn)

.await

.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;

let deleted_ids: Vec<SupportBundleUuid> = deleted_ids

.into_iter()

.map(SupportBundleUuid::from_untyped_uuid)

.collect();

Ok(AutoDeletionResult {

deleted_ids,

total_datasets: total_datasets as usize,

active_bundles: active_bundles as usize,

hawkw · 2026-02-03T20:05:34Z

nexus/db-queries/src/db/datastore/support_bundle.rs

+        // Validate percentages are in range
+        if target_free_percent > 100 {
+            return Err(Error::invalid_request(
+                "target_free_percent must be between 0 and 100",
+            ));
+        }
+        if min_keep_percent > 100 {
+            return Err(Error::invalid_request(
+                "min_keep_percent must be between 0 and 100",
+            ));
+        }


won't the CHECK constraint also do this for us? or maybe it gives us a nicer looking error message...

hawkw · 2026-02-03T20:13:59Z

nexus/tests/integration_tests/support_bundles.rs

+// Test automatic deletion of support bundles to maintain free dataset capacity.
+//
+// This test verifies that:
+// 1. Auto-deletion kicks in when free_datasets < target threshold
+// 2. The oldest bundles are marked for deletion first
+// 3. min_keep percentage is respected
+//
+// Configuration: 20% target_free (CEIL(5*20/100)=1), 20% min_keep (CEIL(5*20/100)=1), 5 datasets
+// Create 5 bundles (filling all datasets), and verify auto-deletion happens
+// when we run out of free datasets.


would really like it if this comment was wrapped at column 80 (sorry)

hawkw · 2026-02-03T20:14:13Z

nexus/tests/integration_tests/support_bundles.rs

+    // - When creating bundle 5:
+    //   - Before collection: 4 Active + 1 Collecting = 5 bundles, free = 0
+    //   - Auto-delete triggers: want 1 free, have 0, delete 1 oldest
+    //   - 20% min_keep (CEIL(5*20/100)=1), active=4, max_deletable=3, so we CAN delete


another line that i think ought to be wrapped

hawkw · 2026-02-03T20:15:39Z

schema/crdb/dbinit.sql

+CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (
+    state,
+    time_created
+);
+
+/*
+ * Support Bundle Config
+ *
+ * Configuration for automatic support bundle deletion. This table uses a
+ * singleton pattern (exactly one row) to store cluster-wide configuration.
+ */
+CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (
+    -- Singleton pattern: only one row allowed
+    singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),
+
+    -- Percentage (0-100) of total datasets to keep free for new allocations.
+    -- Calculated as CEIL(total_datasets * target_free_percent / 100).
+    -- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.
+    target_free_percent INT8 NOT NULL
+        CHECK (target_free_percent >= 0 AND target_free_percent <= 100),
+
+    -- Percentage (0-100) of total datasets to retain as bundles (minimum).
+    -- Calculated as CEIL(total_datasets * min_keep_percent / 100).
+    -- Prevents aggressive cleanup on small systems.
+    min_keep_percent INT8 NOT NULL
+        CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),
+
+    time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+-- Default: 10% free datasets, keep at least 10% worth of bundles
+INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)
+VALUES (TRUE, 10, 10, NOW())
+ON CONFLICT (singleton) DO NOTHING;


nitpick: can we wrap some of these lines? perhaps:

Suggested change

CREATE INDEX IF NOT EXISTS lookup_bundle_by_state_and_creation ON omicron.public.support_bundle (

state,

time_created

);

/*

* Support Bundle Config

*

* Configuration for automatic support bundle deletion. This table uses a

* singleton pattern (exactly one row) to store cluster-wide configuration.

*/

CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (

-- Singleton pattern: only one row allowed

singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),

-- Percentage (0-100) of total datasets to keep free for new allocations.

-- Calculated as CEIL(total_datasets * target_free_percent / 100).

-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.

target_free_percent INT8 NOT NULL

CHECK (target_free_percent >= 0 AND target_free_percent <= 100),

-- Percentage (0-100) of total datasets to retain as bundles (minimum).

-- Calculated as CEIL(total_datasets * min_keep_percent / 100).

-- Prevents aggressive cleanup on small systems.

min_keep_percent INT8 NOT NULL

CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),

time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()

);

-- Default: 10% free datasets, keep at least 10% worth of bundles

INSERT INTO omicron.public.support_bundle_config (singleton, target_free_percent, min_keep_percent, time_modified)

VALUES (TRUE, 10, 10, NOW())

ON CONFLICT (singleton) DO NOTHING;

CREATE INDEX IF NOT EXISTS

lookup_bundle_by_state_and_creation

ON omicron.public.support_bundle (

state,

time_created

);

/*

* Support Bundle Config

*

* Configuration for automatic support bundle deletion. This table uses a

* singleton pattern (exactly one row) to store cluster-wide configuration.

*/

CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_config (

-- Singleton pattern: only one row allowed

singleton BOOL PRIMARY KEY DEFAULT TRUE CHECK (singleton = TRUE),

-- Percentage (0-100) of total datasets to keep free for new allocations.

-- Calculated as CEIL(total_datasets * target_free_percent / 100).

-- Example: 10% of 100 datasets = 10 free, 10% of 5 datasets = 1 free.

target_free_percent INT8 NOT NULL

CHECK (target_free_percent >= 0 AND target_free_percent <= 100),

-- Percentage (0-100) of total datasets to retain as bundles (minimum).

-- Calculated as CEIL(total_datasets * min_keep_percent / 100).

-- Prevents aggressive cleanup on small systems.

min_keep_percent INT8 NOT NULL

CHECK (min_keep_percent >= 0 AND min_keep_percent <= 100),

time_modified TIMESTAMPTZ NOT NULL DEFAULT NOW()

);

-- Default: 10% free datasets, keep at least 10% worth of bundles

INSERT INTO omicron.public.support_bundle_config (

singleton,

target_free_percent,

min_keep_percent,

time_modified

)

VALUES (TRUE, 10, 10, NOW())

ON CONFLICT (singleton) DO NOTHING;

hawkw · 2026-02-03T20:15:49Z

schema/crdb/bundle-state-index/up1.sql

mind line-wrapping this?

hawkw · 2026-02-03T20:15:58Z

schema/crdb/bundle-state-index/up3.sql

...and wrapping this?

smklein marked this pull request as draft January 15, 2026 21:19

smklein force-pushed the support-bundle-auto-delete branch from 5bc38cc to c78018e Compare January 15, 2026 21:26

Support bundle: automatic deletion to maintain free dataset buffer

1c94bc0

smklein force-pushed the support-bundle-auto-delete branch from c78018e to 1c94bc0 Compare January 15, 2026 21:38

smklein added 7 commits January 15, 2026 14:18

merge

dab1d98

Switch to percentage-based, DB encoded configs

8574529

Add more tests, fix accounting for destroying/failing bundles

564395c

merge

88e1d4e

self-review

73d2699

smklein marked this pull request as ready for review January 22, 2026 20:06

smklein requested review from hawkw and papertigers January 22, 2026 20:06

smklein mentioned this pull request Jan 22, 2026

Support Bundle: Automatic Deletion #9660

Open

merge

d33c3ec

hawkw reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bundle: automatic deletion to maintain free dataset buffer#9662

Support bundle: automatic deletion to maintain free dataset buffer#9662
smklein wants to merge 9 commits intomainfrom
support-bundle-auto-delete

smklein commented Jan 15, 2026 •

edited

Loading

Uh oh!

hawkw left a comment

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

hawkw Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	//! omdb commands for support bundle auto-deletion configuration
	//! omdb commands for support bundle configuration

Conversation

smklein commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Tests

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smklein commented Jan 15, 2026 •

edited

Loading