Skip to content

Conversation

@winterhazel
Copy link
Member

Description

In order to copy an object from a secondary storage to another, the MS sends a copy command to the least busy SSVM; however, removed SSVMs are considered when listing the SSVMs ordered by the quantity of commands being executed. This way, if there are commands that have not finished properly in a removed SSVM, a NPE occurs when sending the copy command if the "least busy" SSVM has been removed.

This PRs fixes the NPE.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

  1. I initiated a data migration between secondary storages;
  2. I verified that the cmd_exec_log table had one entry for a command sent to the SSVM;
  3. Before the data migration finished, I deleted the SSVM;
  4. I verified that the cmd_exec_log still had one entry for the removed SSVM;
  5. I waited until another SSVM was created;
  6. I initiated another data migration between secondary storages.

Before the changes, step 6 would result in a NPE, as the removed SSVM was sometimes chosen as the least busy SSVM.

2024-05-24T12:41:47,662 DEBUG [o.a.c.s.i.SecondaryStorageServiceImpl] (pool-19-thread-2:[]) (logid:) Failed to copy Data java.lang.NullPointerException: Cannot invoke "com.cloud.host.Host.getId()" because "host" is null
	at org.apache.cloudstack.storage.RemoteHostEndPoint.configure(RemoteHostEndPoint.java:76)
	at org.apache.cloudstack.storage.RemoteHostEndPoint.getHypervisorHostEndPoint(RemoteHostEndPoint.java:90)
	at org.apache.cloudstack.storage.endpoint.DefaultEndPointSelector.getEndPointFromHostId(DefaultEndPointSelector.java:454)
	at org.apache.cloudstack.storage.image.BaseImageStoreDriverImpl.sendToLeastBusyEndpoint(BaseImageStoreDriverImpl.java:439)
	at org.apache.cloudstack.storage.image.BaseImageStoreDriverImpl.copyAsync(BaseImageStoreDriverImpl.java:413)
	at org.apache.cloudstack.storage.image.BaseImageStoreDriverImpl.copyAsync(BaseImageStoreDriverImpl.java:422)
	at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
	at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:117)
	at org.apache.cloudstack.storage.image.SecondaryStorageServiceImpl.migrateJob(SecondaryStorageServiceImpl.java:167)
	at org.apache.cloudstack.storage.image.SecondaryStorageServiceImpl.migrateData(SecondaryStorageServiceImpl.java:124)
	at org.apache.cloudstack.engine.orchestration.StorageOrchestrator$MigrateDataTask.call(StorageOrchestrator.java:560)
	at org.apache.cloudstack.engine.orchestration.StorageOrchestrator$MigrateDataTask.call(StorageOrchestrator.java:529)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

After applying the changes, step 6 does not throw a NPE, as the removed SSVM is not considered when listing the SSVMs anymore.

@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.

Project coverage is 15.55%. Comparing base (1e12a80) to head (34e3c16).
Report is 15 commits behind head on main.

Files Patch % Lines
.../src/main/java/com/cloud/host/dao/HostDaoImpl.java 0.00% 15 Missing ⚠️
...dstack/storage/image/BaseImageStoreDriverImpl.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##               main    #9125     +/-   ##
===========================================
  Coverage     15.55%   15.55%             
  Complexity    12011    12011             
===========================================
  Files          5500     5500             
  Lines        481889   481889             
  Branches      62240    58646   -3594     
===========================================
+ Hits          74966    74970      +4     
+ Misses       398637   398633      -4     
  Partials       8286     8286             
Flag Coverage Δ
uitests 4.17% <ø> (ø)
unittests 16.33% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9700

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, but i think we must consider moving this to the DAO layer.. maybe not in the scope of this pr and we'll need a standard way/place to deal with joins from the code as opposed to views.

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10297)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50862 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9125-t10297-kvm-centos7.zip
Smoke tests completed. 130 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_trigger_shutdown Failure 351.82 test_safe_shutdown.py

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

Not tested

@borisstoyanov borisstoyanov self-assigned this Jun 10, 2024
Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested it, on the second try with new SSVM running I did not had the NPE with a record still in the DB. @DaanHoogland I think your comment is still standing, assigning to you and removing the needs-testing label

@DaanHoogland
Copy link
Contributor

@winterhazel , do you think my comment is worth your effort now? If not I'll add a suggestion for a comment for later work.

@winterhazel
Copy link
Member Author

@winterhazel , do you think my comment is worth your effort now? If not I'll add a suggestion for a comment for later work.

I'll move it to the DAO layer in this PR already ;)

@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9870

@BryanMLima
Copy link
Contributor

@DaanHoogland, can we run the CI one last time here?

@DaanHoogland
Copy link
Contributor

Yes @BryanMLima but our lab is reserved for 4.19 work this week. main will have to wait.

@JoaoJandre
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10453

@JoaoJandre
Copy link
Contributor

@DaanHoogland could we run the CI here?

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10615

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-11062)

@winterhazel
Copy link
Member Author

@DaanHoogland could we run the CI again?

@DaanHoogland
Copy link
Contributor

@blueorangutan LLtest

@blueorangutan
Copy link

@DaanHoogland a [LL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[LL]Trillian test result (tid-6967)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 51941 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9125-t6967-kvm-centos7.zip
Smoke tests completed. 135 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_domain_admin_access Failure 0.22 test_account_access.py
test_01_invalid_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.00 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 0.00 test_kubernetes_clusters.py
ContextSuite context=TestResourceNames>:setup Error 0.00 test_resource_names.py
test_02_unsecure_vm_migration Error 422.55 test_vm_life_cycle.py

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10775

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11172)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56183 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9125-t11172-kvm-ol8.zip
Smoke tests completed. 137 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_unsecure_vm_migration Error 446.67 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 375.85 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 356.68 test_vm_life_cycle.py
test_01_redundant_vpc_site2site_vpn Failure 444.73 test_vpc_vpn.py

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11203)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 47811 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9125-t11203-kvm-ol8.zip
Smoke tests completed. 139 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit 7692b74 into apache:main Aug 28, 2024
@yadvr yadvr added this to the 4.20.0.0 milestone Aug 29, 2024
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 30, 2024
@DaanHoogland DaanHoogland removed their assignment Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants