Skip to content

Add Nexus sagas for external subnet attach / detach#9779

Merged
bnaecker merged 6 commits intomainfrom
subnet-attach-detach-sagas
Feb 5, 2026
Merged

Add Nexus sagas for external subnet attach / detach#9779
bnaecker merged 6 commits intomainfrom
subnet-attach-detach-sagas

Conversation

@bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Feb 3, 2026

  • Adds a subnet attach and detach saga in Nexus, modeled after the existing floating IP attachment sagas.
  • Update the common code for sagas to include passing or removing the attached subnets to Dendrite and / or OPTE. This is to pick up those changes during the existing instance sagas, e.g. instance update.
  • Fixes Saga for attaching external subnets to instances #9685

- Add APIs to the sled agent for attaching and detaching either a single
  subnet on an instance, or setting / clearing the entire set for an
  instance.
- Add list of attached subnets in the instance-creation request body,
  and fill that in from Nexus with the (currently-empty) set of attached
  subnets for the target instnace.
- Plumb attachment requests all the way through the sled-agent internals
  to the new APIs in OPTE.
- Add mapping of attached subnets per-instance to the simulated sled
  agent for testing.
- Fixes #9702
- Adds a subnet attach and detach saga in Nexus, modeled after the
  existing floating IP attachment sagas.
- Update the common code for sagas to include passing or removing the
  attached subnets to Dendrite and / or OPTE. This is to pick up those
  changes during the existing instance sagas, e.g. instance update.
- Fixes #9685
@bnaecker
Copy link
Collaborator Author

bnaecker commented Feb 3, 2026

Stacked on #9778

@bnaecker
Copy link
Collaborator Author

bnaecker commented Feb 3, 2026

So the failure on the helios-deploy job looks like a race. We can see that it attempts to detect the switch zone as being up by curling against the Dendrite API server. It last does that at 2026-02-03T02:20:45.222Z and reports a failure. When collecting evidence, we try to login to the switch zone for various things, such as collecting the link state. There we see this:

2026-02-03T02:20:45.634Z	+ pfexec zlogin oxz_switch /opt/oxide/dendrite/bin/swadm link ls
2026-02-03T02:20:45.634Z	Error: failed to list all links
2026-02-03T02:20:45.634Z	
2026-02-03T02:20:45.634Z	Caused by:
2026-02-03T02:20:45.634Z	    0: Communication Error: error sending request for url (http://localhost:12224/links)
2026-02-03T02:20:45.634Z	    1: error sending request for url (http://localhost:12224/links)
2026-02-03T02:20:45.634Z	    2: client error (Connect)
2026-02-03T02:20:45.644Z	    3: tcp connect error
2026-02-03T02:20:45.644Z	    4: Connection refused (os error 146)

That means swadm can't even talk to Dendrite over localhost! But looking in the saved Dendrite logs, we see this at the top:

2026-02-03T02:20:46.666Z	INFO	dpd: dpd config: Config 

So it seems like Dendrite had not started by the time we started, or even failed, the check for the switch zone being up. I'm guessing this is fallout from #9767, and we were succeeding before by accident. We probably need to increase that timeout, but it would also be good to know why the Dendrite service is taking so long to start here. The SMF log file shows the service was enabled at 02:20:15, but the start method didn't run for another 30s. We might want to look at how the sled-agent initializes this zone, and whether this lag is expected.

@bnaecker bnaecker requested a review from FelixMcFelix February 4, 2026 04:17
- Delete entries from all switches, not just boundary
- Keep `rack_id` name for future API compat
- Respect the `do_saga` flag
Base automatically changed from sled-agent-attached-subnet-api to main February 4, 2026 17:24
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient, mainly docs nits etc. outside of the deletion comment I left earlier.

- Improved naming of consts and methods
- Many comment improvements in subnet {at,de}tach methods, point to
  locking notes in instance IP attach saga, improve handling of deleted
  instances during subnet sagas
- Add test that detaching instance subnets doesn't delete the subnets.
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fab, thanks for putting this together!

@bnaecker
Copy link
Collaborator Author

bnaecker commented Feb 4, 2026

Going to merge in main, and then set this to automerge. Thanks for the careful reviews everyone!

@bnaecker bnaecker enabled auto-merge (squash) February 4, 2026 23:14
@bnaecker bnaecker merged commit 08c0d9b into main Feb 5, 2026
16 checks passed
@bnaecker bnaecker deleted the subnet-attach-detach-sagas branch February 5, 2026 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Saga for attaching external subnets to instances

6 participants