Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to virError #5755

Closed
albinsun opened this issue May 7, 2024 · 5 comments
Closed
Labels
area/os area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround

Comments

@albinsun
Copy link

albinsun commented May 7, 2024

Describe the bug
During v1.2.1 to v1.2.2-rc2 upgrade, hit live migration fail in the pre-drain phase due to continue looping following

VirtualMachineInstance migration uid 46c9f7af-1f77-4f94-bf78-e8320c9a9b24 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=31, Message='operation failed: guest CPU doesn't match specification: missing features: waitpkg')

To Reproduce

  1. 🟢 Create 3 node v1.2.1 cluster
  2. 🟢 Create a VM (here located on node-2)
  3. 🔴 Upgrade to v1.2.2-rc2
    • node-0 and node-1 are Succeeded
    • node-2 stuck in Pre-draining due to live migration fail
      image
      image
      image

Expected behavior
Upgrade successfully

Support bundle
supportbundle_stuck_predraining.zip

Upgrade log
hvst-upgrade-gvmcm-upgradelog-archive-stuck_predraining.zip

Environment

  • Harvester
    • Version: v1.2.1 -> v1.2.2-rc2
    • Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
    • ui-source: Auto
@albinsun albinsun added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact) area/upgrade reproduce/needed Reminder to add a reproduce label and to remove this one labels May 7, 2024
@albinsun albinsun added this to the v1.2.2 milestone May 7, 2024
@albinsun
Copy link
Author

albinsun commented May 7, 2024

FYI
Can also reproduce in 2 nodes cluster.
Note that there is no migration in single node and we hit VM forcely shutdown instead.

Two nodes (8C/20G per node)

  1. 🟢 Setup 2 nodes harvester-v1.2.1

    image

  2. 🟢 Create a VM (VM locates on node-1)

    image

  3. 🔴 Upgrade to v1.2.2-rc2

    node-1 is Cordoned and stuck in Pre-draining
    image
    VM live migration fail
    image
    image

    VirtualMachineInstance migration uid 5de2134c-25e2-404e-88b2-9307f54866c8 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=31, Message='operation failed: guest CPU doesn't match specification: missing features: waitpkg')
    

Single node (16C/32G)

Note

No migration since only one node.

Upgrade successfully but VM was forced off.
image

@albinsun albinsun added reproduce/always Reproducible 100% of the time and removed reproduce/needed Reminder to add a reproduce label and to remove this one labels May 7, 2024
@albinsun albinsun changed the title [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to guest CPU doesn't match specification: missing features: waitpkg May 7, 2024
@albinsun albinsun changed the title [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to guest CPU doesn't match specification: missing features: waitpkg [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to virError(guest CPU doesn't match specification: missing features: waitpkg) May 7, 2024
@albinsun albinsun changed the title [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to virError(guest CPU doesn't match specification: missing features: waitpkg) [BUG] Live migration fail when upgrade v1.2.1 to v1.2.2-rc2 due to virError May 7, 2024
@w13915984028
Copy link
Member

could this be same root cause ase #5756, both have (rke2) VM stucking on migrating and further blocks upgrading.

@bk201
Copy link
Member

bk201 commented May 8, 2024

The cause is some QEMU change between SLES SP4 and SP5. The issue happens when harvester nodes are in VMs and guests are in nested VMs. Here is the words from virtualization team:

but the bug is rather that you see the waitpkg flag in SP4, more than the fact that you don't see it in SP5

yes, SP5's QEMU behavior is correct, i.e., on your particular hardware, it's ok to not advertise that flag in a nested VM. It's actually SP4's QEMU that is at fault, i.e., it shouldn't advertise it in the first place, while instead it did. As I said, I can backport the fix to SP's QEMU, but this won't probably help you for that particular VM (or it would break it in even worse way, when/if the updated QEMU would reach SP4's KubeVirt)

Moving the harvester VM cpumode from host-passthrough to host-mode workaround the issue.

@albinsun
Copy link
Author

albinsun commented May 8, 2024

Test can pass after let QEMU using default cpu_mode host-model.
Ref. harvester/ipxe-examples#82

In short,

  • host-passthrough will copy all feature from host cpu even libvirt does not understand.
  • host-model will choose from supportted list which closest to host CPU.

So
For Performance: host-passthrough > host-model
For Reliability: host-model > host-passthrough

Ref. https://libvirt.org/formatdomain.html#cpu-model-and-topology

image

@albinsun albinsun added severity/3 Function working but has a major issue w/ workaround area/os and removed severity/1 Function broken (a critical incident with very high impact) labels May 8, 2024
@albinsun albinsun added the not-require/test-plan Skip to create a e2e automation test issue label May 9, 2024
@albinsun
Copy link
Author

Do not observe live migration fail after taking the workaround mentioned in harvester/ipxe-examples#82.
Let's close this issue first and discuss whether to commit the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/os area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

3 participants