[Perf] MCAD take a significant time to process and schedule AppWrappers

As part of the MCAD load test that we performed, we observed a significant difference between how the default scheduler and MCAD schedule workload on the Pods.

[This plot](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-psap_ci-artifacts/846/pull-ci-openshift-psap-ci-artifacts-main-codeflare-e2e/1681197351064047616/artifacts/e2e/test/artifacts/000__test-case_cpu_light_all_schedulable/000__mcad_load_test_multiple_values/expe/aw.count=150_aw.job.job_mode=False_20230718_0836.2b2d/002__plots/report_00_report:_error_report.html) shows how MCAD scheduled 150 Pods with low CPU requirement (all the Pods could fit on the available nodes):

![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/5810adb5-fa59-4401-9907-9d8a04eb0151)

The test ran in `14.2 minutes`.

[This plot](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-psap_ci-artifacts/846/pull-ci-openshift-psap-ci-artifacts-main-codeflare-e2e/1681197351064047616/artifacts/e2e/test/artifacts/000__test-case_cpu_light_all_schedulable/000__mcad_load_test_multiple_values/expe/aw.count=150_aw.job.job_mode=True_20230718_0825.b21c/002__plots/report_00_report:_error_report.html) shows the result of the same test, but with `Job` resources instead of `AppWrappers`.

![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/0c3e7c81-1f73-40b9-b8e4-aaf7015e4391)

The test ran in `6.3 minutes`. 

Note that in both cases, the Pods ran for `5 minutes`, so the default scheduler scheduling confirms the expectation that all the Pods fit simultaneously on the cluster.

---

![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/a8fbd94f-a1ff-4bbd-9177-d0cb73e8c441)

[This plot](https://rhods-baremetal-results.s3.amazonaws.com/local-ci/codeflare/codeflare-light/20230718_1613/000__test/000__test-case_gpu_all_schedulable/000__mcad_load_test_multiple_values/expe/aw.count=200_aw.job.job_mode=False_node.count=2_20230718_1606.0431/002__plots/report_00_report:_error_report.html) shows a similar result, with 200 Pods requesting each 1 GPU.
There is a total of 200 GPU resources available in the system (2 physical GPUs, each time-sliced into 100 GPU resources).
The test took `21.6 minutes` to run.

![image](https://github.com/project-codeflare/multi-cluster-app-dispatcher/assets/7559202/075b18f2-7046-4ecc-903b-c9157491adb1)
[This plot](https://rhods-baremetal-results.s3.amazonaws.com/local-ci/codeflare/codeflare-light/20230718_1613/000__test/000__test-case_gpu_all_schedulable/000__mcad_load_test_multiple_values/expe/aw.count=200_aw.job.job_mode=True_node.count=2_20230718_1548.3714/002__plots/report_00_report:_error_report.html) shows how the default scheduler performed.
The test took `13.9 minutes` to run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] MCAD take a significant time to process and schedule AppWrappers #510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf] MCAD take a significant time to process and schedule AppWrappers #510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions