Description
openedon Jul 20, 2023
instance_ensure_common
blocks on initialization of the VM controller, which includes steps that synchronously allocate all guest resources and set up all guest entities:
propolis/bin/propolis-server/src/lib/server.rs
Lines 486 to 522 in fbd701c
If this takes a long time (e.g. because Crucible activation takes a while, because it takes a while to set up guest memory, etc.), the client might time out even though the VM will eventually be able to start. This can cause issues like oxidecomputer/omicron#3417, which appears to be a case where Nexus gave up on a sled agent call to start an instance that took longer than Nexus wanted.
We should certainly optimize VM start times as much as possible, but it would also be great to deal with the long tail of slow instance startup cases by making startup fully asynchronous:
- Calling
instance_ensure
creates a VM controller but then returns immediately - The VM controller sets up entities and starts vCPUs asynchronously
- Calls to
instance_get
that occur while this is happening report that the VM is "Starting"
- Calls to
- Requests to change the VM state while it's being created are handled appropriately
- Calls to stop the VM abort creation and destroy the controller
- Other requests (e.g. reboot, migrate) fail with a "wrong state for operation" error status
Triage: marking as Unscheduled for now; many of these issues should be mitigated in practice by using the reservoir to allocate guest memory, and there are probably still opportunities to improve startup times further that we might want to pursue first.