Skip to content

Report reason for power state change#2515

Draft
evan-oxide wants to merge 7 commits into
masterfrom
evan/report-power-state-change-reason
Draft

Report reason for power state change#2515
evan-oxide wants to merge 7 commits into
masterfrom
evan/report-power-state-change-reason

Conversation

@evan-oxide
Copy link
Copy Markdown
Contributor

This addresses #2335 and requires oxidecomputer/management-gateway-service#492. Compute sleds were already tracking this reason for the last power state change but not exposing it to MGS. I added reason-tracking to sidecar as well, and now all boards using the control-plane-agent task can report a reason along with their current power state.

For the sake of backwards compatibility, I added new IPC calls and MGS messages instead of modifying the existing ones. It sounds like this is unfortunately the best approach for now.

The sidecar-seq-server needs extra attention in the review. I'm not confident that it's recording the correct reasons or making the right distinction between "policy changes" and "state changes". sidecar-seq-server tracks policy changes internally but then control-plane-agent renames them to state changes, to be consistent with the compute sleds.

Todo:

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes, which I hope are helpful!

Comment on lines +630 to +633
// TODO what's the right reason?
self.update_state_internal(
PowerState::A0PlusHP,
StateChangeReason::Unknown,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be whatever the reason for the overall state transition we are currently undergoing. A0 to A0PlusHp means we have finished the ongoing transition to A0.

Suggested change
// TODO what's the right reason?
self.update_state_internal(
PowerState::A0PlusHP,
StateChangeReason::Unknown,
self.update_state_internal(
PowerState::A0PlusHP,
self.reason,

Comment on lines +667 to +670
// TODO what's the right reason?
self.update_state_internal(
PowerState::A0,
StateChangeReason::Unknown,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one, on the other hand, should maybe be a new thing? It's saying that --- for whatever reason --- the NIC was powered on and is now being powered off. On Cosmo we have a "nic mapo" reason, but I'm not sure if this is necessarily a MAPO, especially since on Gimlet, the NIC is actually in the A0 MAPO domain due to "reasons". So I think this means the host has decided to PERST the NIC? But don't take my word on it.

Comment on lines +1158 to +1162
// TODO what's the right reason?
server.tofino.set_policy(
TofinoSequencerPolicy::LatchOffOnFault,
PolicyChangeReason::InitialPowerOn,
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that InitialPowerOn feels correct to me here!

Comment on lines +102 to +105
pub struct PowerStateWithReason {
pub state: PowerState,
pub reason: StateChangeReason,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are adding a new message for this anyway, i wonder if it's worth also including the since timestamp (the hubris uptime, in milliseconds, at which we transitioned to this state), and maybe also the delta between that and the current uptime? it's not strictly necessary just to solve the problem of "want to know why power state changed" but feels like a handy thing to throw in, and it's just 8 bytes in what's going to be a pretty tiny UDP datagram...

@evan-oxide
Copy link
Copy Markdown
Contributor Author

evan-oxide commented May 14, 2026

I removed SpStateV4 as discussed offline, to avoid messing with an update-critical interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants