Skip to content

instance ensure could be fully asynchronous #471

Open

Description

instance_ensure_common blocks on initialization of the VM controller, which includes steps that synchronously allocate all guest resources and set up all guest entities:

// Parts of VM initialization (namely Crucible volume attachment) make use
// of async processing, which itself is turned synchronous with `block_on`
// calls to the Tokio runtime.
//
// Since `block_on` will panic if called from an async context, as we are in
// now, the whole process is wrapped up in `spawn_blocking`. It is
// admittedly a big kludge until this can be better refactored.
let vm = {
let properties = properties.clone();
let use_reservoir = server_context.static_config.use_reservoir;
let bootrom = server_context.static_config.vm.bootrom.clone();
let log = server_context.log.clone();
let hdl = tokio::runtime::Handle::current();
let ctrl_hdl = hdl.clone();
let vm_hdl = hdl.spawn_blocking(move || {
VmController::new(
instance_spec,
properties,
use_reservoir,
bootrom,
producer_registry,
nexus_client,
log,
ctrl_hdl,
stop_ch,
)
});
vm_hdl.await.unwrap()
}
.map_err(|e| {
HttpError::for_internal_error(format!(
"failed to create instance: {}",
e
))
})?;

If this takes a long time (e.g. because Crucible activation takes a while, because it takes a while to set up guest memory, etc.), the client might time out even though the VM will eventually be able to start. This can cause issues like oxidecomputer/omicron#3417, which appears to be a case where Nexus gave up on a sled agent call to start an instance that took longer than Nexus wanted.

We should certainly optimize VM start times as much as possible, but it would also be great to deal with the long tail of slow instance startup cases by making startup fully asynchronous:

  • Calling instance_ensure creates a VM controller but then returns immediately
  • The VM controller sets up entities and starts vCPUs asynchronously
    • Calls to instance_get that occur while this is happening report that the VM is "Starting"
  • Requests to change the VM state while it's being created are handled appropriately
    • Calls to stop the VM abort creation and destroy the controller
    • Other requests (e.g. reboot, migrate) fail with a "wrong state for operation" error status

Triage: marking as Unscheduled for now; many of these issues should be mitigated in practice by using the reservoir to allocate guest memory, and there are probably still opportunities to improve startup times further that we might want to pursue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions