Closed
Description
We've been seeing a particularly nasty error at Vates, with symptoms like:
$ xe vm-resume uuid=XXX
xenopsd internal error: File "ocaml/xenopsd/xc/domain.ml",
line 1585, characters 2-8: Assertion failed
The assertion is this in restore
:
let restore (task : Xenops_task.task_handle) ~xc ~xs ~dm ~store_domid
~console_domid ~no_incr_generationid ~timeoffset ~extras info ~manager_path
~vtpm domid fd vgpu_fd =
let static_max_kib = info.memory_max in
let target_kib = info.memory_target in
...
assert (target_mib <= static_max_mib) ;
(We do not have any patches against any of the code involved, nor have I found anything that'd make this less likely on master - I'm testing on 25.6.0)
Suspending and resuming a VM on a supporter host makes this a lot more likely to occur (but doesn't always occur even there).
I've boiled the cause down to this:
On a good attempt, the sequence is as follows:
09:55:26 xenopsd-xc: [debug||32 |Async.VM.suspend|xenops]
Final memory usage of the domain = 1049083 pages
09:55:27 xenopsd-xc: [debug||32 |Async.VM.suspend|xenops]
Storing final memory usage, suspend_memory_bytes: 4297043968
09:55:35 xapi: [debug||103 |org.xen.xapi.xenops.classic events|xenops]
Will update VM.allowed_operations because power_state has changed.
09:55:35 xapi: [debug||103 |org.xen.xapi.xenops.classic events|xenops]
xenopsd event: Updating VM XXX power_state <- Suspended
< not interrupted by anything >
09:55:36 xapi: [ info||103 |org.xen.xapi.xenops.classic events|xenops]
xenops: VM.export_metadata
09:55:37 xapi: [debug||103 |org.xen.xapi.xenops.classic events|xenops]
last_booted_record set to "suspend_memory_bytes":4297043968,
"memory_target":4194304,
"memory_max":4194304 <--- right numbers
all of data sent after has right suspend_memory_bytes, e.g.:
09:55:39 xapi: [debug||17 HTTPS 2a01:240:ab08:5:15::23->:::80|Async.VM.resume|xenops]
Sending VM XXX configuration to xenopsd
09:55:39 xapi: [ info||17 HTTPS 2a01:240:ab08:5:15::23->:::80|Async.VM.resume|xenops]
xenops: VM.import_metadata "suspend_memory_bytes\\":4297043968 <--- right numbers
A bad sequence looks like this:
10:46:36 xenopsd-xc: [debug||40 |Async.VM.suspend|xenops]
Storing final memory usage, suspend_memory_bytes: 4297043968
10:46:40 xapi: [debug||101 |org.xen.xapi.xenops.classic events|xenops]
Will update VM.allowed_operations because power_state has changed.
10:46:40 xapi: [debug||101 |org.xen.xapi.xenops.classic events|xenops]
xenopsd event: Updating VM XXX power_state <- Suspended
< interrupted by event that overwrites data >
10:46:41 xapi: [debug||103 |xapi events|xenops]
VM XXX metadata has changed: updating xenopsd
10:46:41 xapi: [ info||103 |xapi events|xenops]
xenops: VM.import_metadata "suspend_memory_bytes\\":2149568512 <-- wrong number
<>
10:46:42 xapi: [ info||101 |org.xen.xapi.xenops.classic events|xenops]
xenops: VM.export_metadata
10:46:42 - last_booted_record set to wrong numbers now
xenopsd on restore picks up wrong numbers and raises the assert, rightfully.
10:46:48 xenopsd-xc: [ warn||18 |Async.VM.resume|xenops]
static_max_kib: 2097152, target_kib: 2099188, static_max_mib: 2048, target_mib: 2050
I'm looking into how this could be fixed, but posting it here as a warning - maybe someone already knows this is an issue?
Metadata
Metadata
Assignees
Labels
No labels