Skip to content

Conversation

@yadvr
Copy link
Member

@yadvr yadvr commented Aug 3, 2023

This makes the KVM guru pass the max memory to KVM agent by picking the max. memory from the VM transfer object. This lets the cluster/agent decide if VM should start with the min or max memory depending on whether vm.memballoon.disable is false (default) or true.

When vm.memballoon.disable is true, VMs start with max. memory. Right now the max memory isn't sent to the KVM agent causing the regression. This feature is defined at https://github.com/apache/cloudstack/blob/main/agent/conf/agent.properties#L220

Fixes: #7794

Also
Fixes #7430 if user sets vm.memballoon.disable to true in agent.properties.

Doc PR to clarify this further at apache/cloudstack-documentation#337

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

yadvr added 2 commits August 3, 2023 16:21
This makes the KVM guru pass the max memory to KVM agent by picking the
max. memory from the VM transfer object. This lets the cluster/agent
decide if VM should start with the min or max memory depending on
whether vm.memballoon.disable is false (default) or true.

When `vm.memballoon.disable` is true, VMs start with max. memory. Right
now the max memory isn't sent to the KVM agent causing the regression.

Also fixes apache#7430

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr yadvr changed the title Issue workaround Allow KVM overcommit to work without reducing minimum VM memory when vm ballooning is disabled Aug 3, 2023
@yadvr yadvr added this to the 4.18.1.0 milestone Aug 3, 2023
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr
Copy link
Member Author

yadvr commented Aug 3, 2023

I think here's how we deal with three issues:

  1. the centos7 blocker hopefully will be fixed if the revert committed caused that regression
  2. by default it passes the VM max memory via KVM guru, so libvirt computing resource uses the max VM memory when VM ballooning is disabled; and in env where VM ballooning somehow works the setting is not disabled and VM start with min. memory
  3. we change the default behaviour to have VM ballooning feature disable by default when the setting isn't defined. Existing env when they upgrade will continue to honour the setting in agent.properties; but users not reading docs will not get affected if they use commit memory in KVM env.

please review and advise - @DaanHoogland @weizhouapache @harikrishna-patnala

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tnx @rohityadavcloud , I'll test against all issues related

@weizhouapache
Copy link
Member

I would suggest not to change the default value. @rohityadavcloud

yadvr and others added 8 commits August 3, 2023 18:18
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
…/resource/LibvirtVMDefTest.java

Co-authored-by: dahn <daan.hoogland@gmail.com>
@yadvr
Copy link
Member Author

yadvr commented Aug 3, 2023

@weizhouapache in that case let me grab Daan's workaround to not apply min memory for systemvms

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6648


grd.setMemorySize(maxRam);
grd.setCurrentMem(currRam);
grd.setCurrentMem(getCurrentMemAccordingToMemBallooning(vmTO, maxRam));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised the maxRam seetting is fixed as in the KVMGuru the correct maxRam is passed now.
cc @DaanHoogland @weizhouapache let's test

@yadvr
Copy link
Member Author

yadvr commented Aug 3, 2023

Okay @weizhouapache @DaanHoogland I've address code comments and revert the change of balloon behaviour; if memory ballooning feature works then I think it might just work as earlier we weren't passing the right max ram value;
but let's test

@blueorangutan package

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix keepEnv

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Merging #7810 (14c83ab) into 4.18 (c86684f) will decrease coverage by 0.01%.
Report is 3 commits behind head on 4.18.
The diff coverage is 90.90%.

@@             Coverage Diff              @@
##               4.18    #7810      +/-   ##
============================================
- Coverage     13.02%   13.02%   -0.01%     
+ Complexity     9031     9029       -2     
============================================
  Files          2720     2720              
  Lines        257010   257010              
  Branches      40083    40081       -2     
============================================
- Hits          33472    33471       -1     
- Misses       219336   219338       +2     
+ Partials       4202     4201       -1     
Files Changed Coverage Δ
...ervisor/kvm/resource/LibvirtComputingResource.java 18.45% <88.88%> (+0.13%) ⬆️
...om/cloud/hypervisor/kvm/resource/LibvirtVMDef.java 67.13% <100.00%> (ø)

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@DaanHoogland DaanHoogland linked an issue Aug 4, 2023 that may be closed by this pull request
Copy link
Member

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@yadvr
Copy link
Member Author

yadvr commented Aug 4, 2023

Thanks @DaanHoogland if you see my comment, we need not re-open the previous issue. Ask @weizhouapache to review and lgtm/advise;

This PR handles the following cases:

  1. Doc PR addresses what to expect, what/how to configure KVM host env when overcommiting memory - hosts: make it clear what to expect and how to configure balloon feature on KVM cloudstack-documentation#337
  2. For non-User VMs, we use max memory and in KVM guru the correct max memory is passed (fixes Smoketests are failing if run on Ubuntu 22.04 and python version (3.10.6) #7443)
  3. Fixes System VMs not starting with CentOS 7 hosts #7794 as centos7/el7 smoketests passing now Allow KVM overcommit to work without reducing minimum VM memory when vm ballooning is disabled #7810 (comment)
  4. Reverts/ensures we don't disable memory balloon feature for KVM

I think this covers most of the cases and doc PR handles advising users the best practice they deem fit for what they're doing.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6666

@blueorangutan
Copy link

[SF] Trillian test result (tid-7258)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 36963 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7258-vmware-67u3.zip
Smoke tests completed. 99 look OK, 9 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 13.63 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 6.42 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 4.41 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 7.53 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 3.35 test_vm_deployment_planner.py
ContextSuite context=TestDeployVM>:setup Error 0.00 test_vm_life_cycle.py
test_01_offline_migrate_VM_and_root_volume Error 10.69 test_vm_life_cycle.py
test_02_offline_migrate_VM_with_two_data_disks Error 12.63 test_vm_life_cycle.py
test_03_live_migrate_VM_with_two_data_disks Error 8.53 test_vm_life_cycle.py
test_04_migrate_detached_volume Error 7.49 test_vm_life_cycle.py
ContextSuite context=TestVAppsVM>:setup Error 2.71 test_vm_life_cycle.py
ContextSuite context=TestVMLifeCycle>:setup Error 15.84 test_vm_life_cycle.py
test_01_unmanage_vm_cycle Error 10.44 test_vm_lifecycle_unmanage_import.py
ContextSuite context=TestUnmanageVM>:teardown Error 10.47 test_vm_lifecycle_unmanage_import.py
test_change_service_offering_for_vm_with_snapshots Error 13.65 test_vm_snapshots.py
ContextSuite context=TestVmSnapshot>:setup Error 35.05 test_vm_snapshots.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.22 test_volumes.py
ContextSuite context=TestIpv6Vpc>:setup Error 0.00 test_vpc_ipv6.py
ContextSuite context=TestVPCRedundancy>:setup Error 0.00 test_vpc_redundant.py
ContextSuite context=TestVPCNics>:setup Error 0.00 test_vpc_router_nics.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py

@yadvr
Copy link
Member Author

yadvr commented Aug 4, 2023

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7257)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38977 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7257-xenserver-71.zip
Smoke tests completed. 101 look OK, 7 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_05_stop_ssvm Error 941.73 test_ssvm.py
test_07_reboot_ssvm Error 914.43 test_ssvm.py
test_09_reboot_ssvm_forced Error 949.60 test_ssvm.py
ContextSuite context=TestIsolatedNetworksPasswdServer>:setup Error 0.00 test_password_server.py
ContextSuite context=TestUnmanageVM>:setup Error 0.00 test_vm_lifecycle_unmanage_import.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVPCRedundancy>:setup Error 0.00 test_vpc_redundant.py
ContextSuite context=TestVPCNics>:setup Error 0.00 test_vpc_router_nics.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7259)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39820 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7259-kvm-centos7.zip
Smoke tests completed. 102 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 82.20 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 54.75 test_vm_life_cycle.py
test_06_download_detached_volume Error 299.76 test_volumes.py
test_13_migrate_volume_and_change_offering Error 121.89 test_volumes.py
ContextSuite context=TestIpv6Vpc>:setup Error 0.00 test_vpc_ipv6.py
ContextSuite context=TestVPCRedundancy>:setup Error 0.00 test_vpc_redundant.py
ContextSuite context=TestVPCNics>:setup Error 0.00 test_vpc_router_nics.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7269)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 45398 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7269-kvm-centos7.zip
Smoke tests completed. 104 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestCpuCapServiceOfferings>:setup Error 0.00 test_service_offerings.py
ContextSuite context=TestServiceOfferings>:setup Error 0.20 test_service_offerings.py
ContextSuite context=TestRouterIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestVPCIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestVMWareStoragePolicies>:setup Error 0.00 test_storage_policy.py
test_01_migrate_VM_and_root_volume Error 82.84 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 56.55 test_vm_life_cycle.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7285)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 45691 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7285-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 85.85 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 59.47 test_vm_life_cycle.py

@DaanHoogland
Copy link
Contributor

DaanHoogland commented Aug 5, 2023

@rohityadavcloud, #7430 is fixed in this PR. When memory overprovisioning is set to 3 on the cluster:

  <maxMemory slots='16' unit='KiB'>524288</maxMemory>
  <memory unit='KiB'>174080</memory>
  <currentMemory unit='KiB'>174080</currentMemory>

when set to 5

  <maxMemory slots='16' unit='KiB'>524288</maxMemory>
  <memory unit='KiB'>104448</memory>
  <currentMemory unit='KiB'>104448</currentMemory>

with overprovisioning set to 1

  <memory unit='KiB'>524288</memory>
  <currentMemory unit='KiB'>524288</currentMemory>

Looks good

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@yadvr
Copy link
Member Author

yadvr commented Aug 5, 2023

Thanks for testing @DaanHoogland I suppose that's true if the vm divide setting is enabled for user-vms; for systemvms the memory would be consistently the same (max memory).

@blueorangutan
Copy link

[SF] Trillian test result (tid-7287)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37829 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7287-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7288)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 43166 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7288-vmware-67u3.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7289)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48809 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7810-t7289-kvm-centos7.zip
Smoke tests completed. 106 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_list_vms_metrics_admin Error 3607.61 test_metrics_api.py
test_list_vms_metrics_history Error 5.42 test_metrics_api.py
test_list_volumes_metrics_history Error 5.38 test_metrics_api.py
test_01_migrate_VM_and_root_volume Error 76.77 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 53.67 test_vm_life_cycle.py

@DaanHoogland
Copy link
Contributor

Thanks for testing @DaanHoogland I suppose that's true if the vm divide setting is enabled for user-vms; for systemvms the memory would be consistently the same (max memory).

Yes @rohityadavcloud , I can test that as well, but we already showed that SVMs start which is what is what matters. Errors in the kvm version seem unrelated, merging.

@DaanHoogland DaanHoogland merged commit dc5e4f3 into apache:4.18 Aug 6, 2023
@DaanHoogland DaanHoogland deleted the issue-workaround branch August 6, 2023 08:39
DaanHoogland added a commit that referenced this pull request Aug 6, 2023
* 4.18:
  Allow KVM overcommit to work without reducing minimum VM memory when vm ballooning is disabled (#7810)
@yadvr
Copy link
Member Author

yadvr commented Aug 6, 2023

Cheers thanks @DaanHoogland

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System VMs not starting with CentOS 7 hosts

5 participants