Skip to content

[3/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order#10099

Merged
sunshowers merged 13 commits into
mainfrom
sunshowers/spr/reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order
Apr 30, 2026
Merged

[3/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order#10099
sunshowers merged 13 commits into
mainfrom
sunshowers/spr/reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order

Conversation

@sunshowers
Copy link
Copy Markdown
Contributor

@sunshowers sunshowers commented Mar 19, 2026

Add a test which makes sure that when Reconfigurator runs an update, it is in topological order as defined by omicron-ls-apis (only considering server-side versioned APIs): for a particular producer/consumer pair, all instances of producer must be updated before the first instance of consumer.

The point of coordination between the two is a new file, dev-tools/ls-apis/tests/output/deployment-unit-dag.toml.

  • The ls-apis tool generates this file as a condensed form of api-manifest.toml with just the edges (and there's an expectorate test for it).
  • The test in nexus-reconfigurator-planning reads this file, simulates an update and builds a trace, then asserts that the update happened in topo order (and various other related properties).

This exercise found three violations:

  1. Host OS (including omicron-sled-agent) is updated before NTP, but omicron-sled-agent is a consumer of NTP. Addressed this by turning NTP into a client-side-versioned API (as discussed in an update watercooler).
  2. omicron-sled-agent depends on cockroach-admin-client during RSS. At RSS time, the system is known to be at a single, consistent version, and we mark this fact through a new rss-only dependency filter rule. (rss-only is similar to non-dag, except it doesn't check that versioned_how is client; there is a larger discussion to be had about how we may wish to define client-side versioning as applying to edges rather than nodes, but that is outside the scope of this work.)
  3. omicron-sled-agent depends on dns-service-client during RSS. We treat this the same as 2.

Depends on:

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from main to sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order March 31, 2026 00:54
@sunshowers sunshowers changed the title [reconfigurator] test that Reconfigurator updates are in server-side topological order [2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order Mar 31, 2026
sunshowers added a commit that referenced this pull request Mar 31, 2026
Previously, the DNS server depended on its own client
(dns-service-client) for two things:

* `TransientServer` for a transient in-memory server.
* The `dnsadm` binary.

This isn't a real dependency in the sense that the DNS server doesn't
use the client to call other instances of itself in normal use, so there
was an exclusion for this in `api-manifest.toml`. This exclusion was
causing issues for #10099.

Move both of these out into their own crates, and drop the exclusion
from `api-manifest.toml`. There is a small, benign change to the output
of `ls-apis apis` (captured by
`dev-tools/ls-apis/tests/api_dependencies.out`). With `--show-deps`:

```
DNS Server (client: dns-service-client)
    [...]
    consumed by: omicron-sled-agent (omicron/sled-agent) via 2 paths
        via path 1: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0
        via path 2: path+file:///home/rain/dev/oxide/omicron/dns-server/transient#transient-dns-server@0.1.0
        via path 2: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0
```
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order to main March 31, 2026 17:14
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as ready for review March 31, 2026 19:44
@davepacheco
Copy link
Copy Markdown
Collaborator

The NTP issue is covered by #8769, which I believe will be resolved with this change.

Copy link
Copy Markdown
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've reviewed everything except the new test itself here.

// The NTP Admin API is client-side-versioned and currently frozen. We
// allow trivial changes to go through. If we did not, we would need to
// unfreeze the API and bump the version number for trivial changes.
.allow_trivial_changes_for_latest(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that:

  • we allow the user to update the OpenAPI document for a blessed version as long as it's trivial, or
  • we allow the generated OpenAPI document to diverge from the corresponding blessed one as long as the changes are trivial?

The first one seems better for the reasons we've previously discussed not wanting to accumulate delta (it makes it harder for future people to see what their changes were) but I guess either is okay right now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the latter, matching the other client-side API versions. I would really like to preserve the property that blessed versions are immutable if at all possible.

One possibility here for future work is introducing something like a minor version (1.1, 1.2, etc), where the minor versions are always wire compatible with the corresponding major version (1.0, etc). Then, we clients can be instructed to send 1.0 in their api-version header rather than 1.1, etc. I haven't fully thought this through so it's possible there are issues with this approach.

Comment thread ntp-admin/api/src/lib.rs
Comment thread nexus/inventory/src/builder.rs Outdated
Comment thread dev-tools/ls-apis-shared/src/lib.rs Outdated
Comment thread dev-tools/ls-apis/src/bin/ls-apis.rs
Comment thread dev-tools/ls-apis/api-manifest.toml
Comment thread dev-tools/ls-apis/api-manifest.toml
Comment thread dev-tools/ls-apis/api-manifest.toml Outdated
Comment thread dev-tools/ls-apis/src/api_metadata.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Comment thread dev-tools/ls-apis/src/api_metadata.rs Outdated
Comment thread dev-tools/ls-apis/tests/test_dependencies.rs
Comment thread dev-tools/ls-apis/api-manifest.toml

fn create_zone_artifacts_at_version(
version: &ArtifactVersion,
trigger_host_os_update: bool,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a big deal but I spent a while working out how this function gets used. It's a lot simpler than I read at first. I think it could be clearer with some tweaks to the arguments. (Unrelated to this change, it seems like it should be called create_artifacts_for_version, since it's not specific to zone artifacts.) Specifically, the two arguments seem like very different kinds of things, but they're both just configuring which version/hash to use for different sets of artifacts. But even though they're both just specifying version/hashes, they do thatit in very different ways (an explicit value vs. picking from one of a pair of canned values).

I can think of several different ways to improve it but I'm not sure which is best. A simple improvement would be to lean into the fact that this function is really always used to pick between one of two configurations and have it accept one argument that's like:

enum WhichVersion {
    InitialSystemVersion,
    UpdatedVersion
}

And it picks a canned set of version strings and hashes for those two cases.

Another option would be to have it accept a version string and host OS info like it does now, but have the host OS info come in as an enum like that, or even as explicit hashes (which could be constants, like INITIAL_HOST_OS_HASHES and UPDATED_HOST_OS_HASHES).

Yet another would be to accept a struct with named fields for each configurable artifact that contains the version and hash for it, and then have global statics/constants for INITIAL_SYSTEM and UPDATED_SYSTEM. This is most general-purpose and clear, but probably the biggest code change.

What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the enum with the canned version set.

Copy link
Copy Markdown
Contributor Author

@sunshowers sunshowers Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this up into a separate PR (upcoming, will update this comment when I put it up)

edit: #10338

Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment on lines +5316 to +5330
// Verify the host_os deployment unit was tracked and completed.
{
let p = progress.get(&host_os_unit).unwrap_or_else(|| {
panic!(
"host_os deployment unit was never updated — did the \
planner not issue any host OS MGS updates?"
)
});
assert!(
p.all_at_target.is_some(),
"host_os deployment unit was updated (first activity at \
iteration {}) but never completed all host OS updates",
p.first_activity,
);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are checking these conditions for host OS... wouldn't it be simpler to have one loop that goes over all of progress and checks all_at_target.is_some()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but fwiw I definitely prefer checking against the expected set rather than the actual set. (As shown in a video call I ended up completely restructuring this in a way that I hope is clearer in general.)

Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment on lines +5349 to +5351
"DAG edge consumer {:?} is not in the progress map \
and is not a known non-zone unit. Is this a new \
deployment unit that needs tracking?",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages seem kind of obscure for someone who doesn't know why they're here. What exactly are they saying? Is it something like: "found a dependency between deployment units in api-manifest.toml, but a complete system update did not update deployment unit {unit}". The only ways I can see this happening are (1) we add some wholly new kind of thing and this test needs to be updated to do something different to make sure it gets updated (unlikely) or (2) someone adds a new deployment unit that doesn't get updated, like installinator (also unlikely I think?). I'm not sure what more specific guidance we can give them (but I'm not clear how "Is this a new deployment unit that needs tracking?" maps to either of these).

Copy link
Copy Markdown
Contributor Author

@sunshowers sunshowers Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main case I think is someone adds a new deployment unit, but doesn't update the example system to produce such a unit. (Though the earlier checks should catch this better.)

I agree that it's a bit hard to hit this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to something like:

                    panic!(
                        "found a dependency between deployment units in \
                         api-manifest.toml, but a complete system update did \
                         not update deployment unit {:?}. This is surprising: \
                         is this a completely new kind of unit that isn't \
                         either a zone kind or the host OS unit?",
                        edge.consumer,
                    );

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from main to sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order-1 April 28, 2026 23:05
@sunshowers sunshowers changed the title [2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order [3/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order Apr 28, 2026
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Copy link
Copy Markdown
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes!

Comment thread dev-tools/omicron-deployment-graph/src/lib.rs Outdated
Comment thread dev-tools/omicron-deployment-graph/src/lib.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
Comment thread nexus/reconfigurator/planning/tests/integration_tests/planner.rs Outdated
sunshowers added a commit that referenced this pull request Apr 29, 2026
…duce WhichVersion (#10338)

Review feedback from #10099, split into a separate commit.

A couple of changes in this PR:

1. Rename `create_zone_artifacts_at_version` to
`create_artifacts_for_version`, since
   the function is not specific to zone artifacts.
2. Switch to a canned-version model using a WhichVersion enum.
Currently, this enum
   has just one variant, though in #10099 it'll gain a second variant.

Most tests don't see a behavior change, though for some tests we'll use
a different version string. (Note that the version string is completely
freeform and is not used as part of any reconfigurator logic.)
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order-1 to main April 30, 2026 00:05
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) April 30, 2026 00:06
@sunshowers sunshowers merged commit 079a2ca into main Apr 30, 2026
18 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order branch April 30, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants