instance ensure could be fully asynchronous

`instance_ensure_common` blocks on initialization of the VM controller, which includes steps that synchronously allocate all guest resources and set up all guest entities: https://github.com/oxidecomputer/propolis/blob/fbd701c0a54f25208712b0b3b2dc7931c875d347/bin/propolis-server/src/lib/server.rs#L486-L522

If this takes a long time (e.g. because Crucible activation takes a while, because it takes a while to set up guest memory, etc.), the client might time out even though the VM will eventually be able to start. This can cause issues like https://github.com/oxidecomputer/omicron/issues/3417, which appears to be a case where Nexus gave up on a sled agent call to start an instance that took longer than Nexus wanted.

We should certainly optimize VM start times as much as possible, but it would also be great to deal with the long tail of slow instance startup cases by making startup fully asynchronous:

- Calling `instance_ensure` creates a VM controller but then returns immediately
- The VM controller sets up entities and starts vCPUs asynchronously
  - Calls to `instance_get` that occur while this is happening report that the VM is "Starting"
- Requests to change the VM state while it's being created are handled appropriately
  - Calls to stop the VM abort creation and destroy the controller
  - Other requests (e.g. reboot, migrate) fail with a "wrong state for operation" error status

---

Triage: marking as Unscheduled for now; many of these issues should be mitigated in practice by using the reservoir to allocate guest memory, and there are probably still opportunities to improve startup times further that we might want to pursue first.

	// Parts of VM initialization (namely Crucible volume attachment) make use
	// of async processing, which itself is turned synchronous with `block_on`
	// calls to the Tokio runtime.
	//
	// Since `block_on` will panic if called from an async context, as we are in
	// now, the whole process is wrapped up in `spawn_blocking`. It is
	// admittedly a big kludge until this can be better refactored.
	let vm = {
	let properties = properties.clone();
	let use_reservoir = server_context.static_config.use_reservoir;
	let bootrom = server_context.static_config.vm.bootrom.clone();
	let log = server_context.log.clone();
	let hdl = tokio::runtime::Handle::current();
	let ctrl_hdl = hdl.clone();

	let vm_hdl = hdl.spawn_blocking(move \|\| {
	VmController::new(
	instance_spec,
	properties,
	use_reservoir,
	bootrom,
	producer_registry,
	nexus_client,
	log,
	ctrl_hdl,
	stop_ch,
	)
	});

	vm_hdl.await.unwrap()
	}
	.map_err(\|e\| {
	HttpError::for_internal_error(format!(
	"failed to create instance: {}",
	e
	))
	})?;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

instance ensure could be fully asynchronous #471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

instance ensure could be fully asynchronous #471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions