Report reason for power state change#2515
Conversation
hawkw
left a comment
There was a problem hiding this comment.
Some notes, which I hope are helpful!
| // TODO what's the right reason? | ||
| self.update_state_internal( | ||
| PowerState::A0PlusHP, | ||
| StateChangeReason::Unknown, |
There was a problem hiding this comment.
I think this would be whatever the reason for the overall state transition we are currently undergoing. A0 to A0PlusHp means we have finished the ongoing transition to A0.
| // TODO what's the right reason? | |
| self.update_state_internal( | |
| PowerState::A0PlusHP, | |
| StateChangeReason::Unknown, | |
| self.update_state_internal( | |
| PowerState::A0PlusHP, | |
| self.reason, |
| // TODO what's the right reason? | ||
| self.update_state_internal( | ||
| PowerState::A0, | ||
| StateChangeReason::Unknown, |
There was a problem hiding this comment.
This one, on the other hand, should maybe be a new thing? It's saying that --- for whatever reason --- the NIC was powered on and is now being powered off. On Cosmo we have a "nic mapo" reason, but I'm not sure if this is necessarily a MAPO, especially since on Gimlet, the NIC is actually in the A0 MAPO domain due to "reasons". So I think this means the host has decided to PERST the NIC? But don't take my word on it.
| // TODO what's the right reason? | ||
| server.tofino.set_policy( | ||
| TofinoSequencerPolicy::LatchOffOnFault, | ||
| PolicyChangeReason::InitialPowerOn, | ||
| ); |
There was a problem hiding this comment.
I think that InitialPowerOn feels correct to me here!
| pub struct PowerStateWithReason { | ||
| pub state: PowerState, | ||
| pub reason: StateChangeReason, | ||
| } |
There was a problem hiding this comment.
since we are adding a new message for this anyway, i wonder if it's worth also including the since timestamp (the hubris uptime, in milliseconds, at which we transitioned to this state), and maybe also the delta between that and the current uptime? it's not strictly necessary just to solve the problem of "want to know why power state changed" but feels like a handy thing to throw in, and it's just 8 bytes in what's going to be a pretty tiny UDP datagram...
|
I removed |
This addresses #2335 and requires oxidecomputer/management-gateway-service#492. Compute sleds were already tracking this reason for the last power state change but not exposing it to MGS. I added reason-tracking to sidecar as well, and now all boards using the control-plane-agent task can report a reason along with their current power state.
For the sake of backwards compatibility, I added new IPC calls and MGS messages instead of modifying the existing ones. It sounds like this is unfortunately the best approach for now.
The sidecar-seq-server needs extra attention in the review. I'm not confident that it's recording the correct reasons or making the right distinction between "policy changes" and "state changes". sidecar-seq-server tracks policy changes internally but then control-plane-agent renames them to state changes, to be consistent with the compute sleds.
Todo: