Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: capi azure async image upload #8782

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

patrickdillon
Copy link
Contributor

We're still working on getting free marketplace images available, so let's see if we can improve managed image creation by making the creation async and run in the background while creating the non-machine resources.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2024
Copy link
Contributor

openshift-ci bot commented Jul 29, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from patrickdillon. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@patrickdillon
Copy link
Contributor Author

/test e2e-azure-ovn-techpreview

1 similar comment
@patrickdillon
Copy link
Contributor Author

/test e2e-azure-ovn-techpreview

Comment on lines +53 to +64
stream, err := rhcos.FetchCoreOSBuild(ctx)
if err != nil {
return fmt.Errorf("failed to get rhcos stream: %w", err)
}
archName := arch.RpmArch(string(installConfig.ControlPlane.Architecture))
streamArch, err := stream.GetArchitecture(archName)
if err != nil {
return fmt.Errorf("failed to get rhcos architecture: %w", err)
}

azureDisk := streamArch.RHELCoreOSExtensions.AzureDisk
imageURL := azureDisk.URL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already done the rhcos/image asset?

Suggested change
stream, err := rhcos.FetchCoreOSBuild(ctx)
if err != nil {
return fmt.Errorf("failed to get rhcos stream: %w", err)
}
archName := arch.RpmArch(string(installConfig.ControlPlane.Architecture))
streamArch, err := stream.GetArchitecture(archName)
if err != nil {
return fmt.Errorf("failed to get rhcos architecture: %w", err)
}
azureDisk := streamArch.RHELCoreOSExtensions.AzureDisk
imageURL := azureDisk.URL
imageURL := in.RhcosImage.ControlPlane

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4f4 yes ;-) I complicated things. This can all go away.


imageLength := headResponse.ContentLength
if imageLength%512 != 0 {
return fmt.Errorf("image length is not alisnged on a 512 byte boundary")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("image length is not alisnged on a 512 byte boundary")
return fmt.Errorf("image length is not aligned on a 512 byte boundary")

galleryGen2ImageName := fmt.Sprintf("%s-gen2", in.InfraID)
galleryGen2ImageVersionName := imageVersion

headResponse, err := http.Head(imageURL) // nolint:gosec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add a reason why the linter can be silenced in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this was to get the size. Go was complaining. We can ditch all of this.

Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This linter report:

pkg/infrastructure/azure/azure.go:573:9: copylocks: Ignition passes lock by value: github.com/openshift/installer/pkg/infrastructure/azure.Provider contains sync.WaitGroup contains sync.noCopy (govet)
func (p Provider) Ignition(ctx context.Context, in clusterapi.IgnitionInput) ([]byte, error) { 

should be looked into. Using copy of locks can result in deadlocks.

@r4f4
Copy link
Contributor

r4f4 commented Jul 30, 2024

time="2024-07-29T21:43:04Z" level=debug msg="Time elapsed per stage:"
time="2024-07-29T21:43:04Z" level=debug msg="                       vnet: 12m12s"
time="2024-07-29T21:43:04Z" level=debug msg="                  bootstrap: 1m55s"
time="2024-07-29T21:43:04Z" level=debug msg="                    cluster: 5m45s"
time="2024-07-29T21:43:04Z" level=debug msg="         Bootstrap Complete: 19m53s"
time="2024-07-29T21:43:04Z" level=debug msg="                        API: 34s"
time="2024-07-29T21:43:04Z" level=debug msg="          Bootstrap Destroy: 2m25s"
time="2024-07-29T21:43:04Z" level=debug msg="Cluster Operators Available: 14m43s"
time="2024-07-29T21:43:04Z" level=debug msg="   Cluster Operators Stable: 48s"
time="2024-07-29T21:43:04Z" level=info msg="Time elapsed: 57m44s"
time="2024-07-29T16:41:31Z" level=debug msg="Time elapsed per stage:"
time="2024-07-29T16:41:31Z" level=debug msg="       Infrastructure Pre-provisioning: 17s"
time="2024-07-29T16:41:31Z" level=debug msg="   Network-infrastructure Provisioning: 3m33s"
time="2024-07-29T16:41:31Z" level=debug msg="Post-network, pre-machine Provisioning: 18m44s"
time="2024-07-29T16:41:31Z" level=debug msg="       Bootstrap Ignition Provisioning: 1s"
time="2024-07-29T16:41:31Z" level=debug msg="                  Machine Provisioning: 7m45s"
time="2024-07-29T16:41:31Z" level=debug msg="      Infrastructure Post-provisioning: 23s"
time="2024-07-29T16:41:31Z" level=debug msg="                    Bootstrap Complete: 23m18s"
time="2024-07-29T16:41:31Z" level=debug msg="                                   API: 13m0s"
time="2024-07-29T16:41:31Z" level=debug msg="                     Bootstrap Destroy: 1m24s"
time="2024-07-29T16:41:31Z" level=debug msg="           Cluster Operators Available: 7m51s"
time="2024-07-29T16:41:31Z" level=debug msg="              Cluster Operators Stable: 51s"
time="2024-07-29T16:41:31Z" level=info msg="Time elapsed: 1h4m10s"
time="2024-07-30T00:57:14Z" level=debug msg="Time elapsed per stage:"
time="2024-07-30T00:57:14Z" level=debug msg="       Infrastructure Pre-provisioning: 43s"
time="2024-07-30T00:57:14Z" level=debug msg="   Network-infrastructure Provisioning: 3m21s"
time="2024-07-30T00:57:14Z" level=debug msg="Post-network, pre-machine Provisioning: 7m8s"
time="2024-07-30T00:57:14Z" level=debug msg="       Bootstrap Ignition Provisioning: 1s"
time="2024-07-30T00:57:14Z" level=debug msg="                  Machine Provisioning: 9m46s"
time="2024-07-30T00:57:14Z" level=debug msg="      Infrastructure Post-provisioning: 22s"
time="2024-07-30T00:57:14Z" level=debug msg="                    Bootstrap Complete: 25m13s"
time="2024-07-30T00:57:14Z" level=debug msg="                                   API: 15m38s"
time="2024-07-30T00:57:14Z" level=debug msg="                     Bootstrap Destroy: 1m28s"
time="2024-07-30T00:57:14Z" level=debug msg="           Cluster Operators Available: 10m41s"
time="2024-07-30T00:57:14Z" level=debug msg="              Cluster Operators Stable: 56s"
time="2024-07-30T00:57:14Z" level=info msg="Time elapsed: 59m41s"

So total up to pre-machine provisioning:

  • before: 22m34s
  • after: 11m12s

But overall not much of a difference? We'd need more runs to build a statistically relevant sample.

@patrickdillon
Copy link
Contributor Author

@r4f4 thanks for the analysis

Yes, let's try to build up at least a small little sample.

/test e2e-azure-ovn-techpreview

Regarding the comments, I generally agree with them. Almost all of that code is simply cut-and-pasted and slightly refactored to be wrapped in goroutines; so I didn't look at it closely. If I do end up pushing this through to completion I will refactor further. 👍

@patrickdillon
Copy link
Contributor Author

Also, I was trying to dig up some logs of a CAPI install with marketplace images and it does look like a marked improvement.

time="2024-07-11T15:18:31Z" level=debug msg="Time elapsed per stage:"
time="2024-07-11T15:18:31Z" level=debug msg="       Infrastructure Pre-provisioning: 10s"
time="2024-07-11T15:18:31Z" level=debug msg="   Network-infrastructure Provisioning: 3m31s"
time="2024-07-11T15:18:31Z" level=debug msg="Post-network, pre-machine Provisioning: 34s"
time="2024-07-11T15:18:31Z" level=debug msg="       Bootstrap Ignition Provisioning: 2s"
time="2024-07-11T15:18:31Z" level=debug msg="                  Machine Provisioning: 3m45s"
time="2024-07-11T15:18:31Z" level=debug msg="      Infrastructure Post-provisioning: 8s"
time="2024-07-11T15:18:31Z" level=debug msg="                    Bootstrap Complete: 21m1s"
time="2024-07-11T15:18:31Z" level=debug msg="                                   API: 7m50s"
time="2024-07-11T15:18:31Z" level=debug msg="                     Bootstrap Destroy: 59s"
time="2024-07-11T15:18:31Z" level=debug msg="           Cluster Operators Available: 8m10s"
time="2024-07-11T15:18:31Z" level=debug msg="              Cluster Operators Stable: 41s"
time="2024-07-11T15:18:31Z" level=info msg="Time elapsed: 39m21s"

@patrickdillon
Copy link
Contributor Author

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8782/pull-ci-openshift-installer-master-e2e-azure-ovn-techpreview/1818294912215420928

b79a7.ci.azure.devcluster.openshift.com"
time="2024-07-30T16:59:15Z" level=info msg="Login to the console with user: \"kubeadmin\", and password: REDACTED
time="2024-07-30T16:59:15Z" level=debug msg="Time elapsed per stage:"
time="2024-07-30T16:59:15Z" level=debug msg="       Infrastructure Pre-provisioning: 43s"
time="2024-07-30T16:59:15Z" level=debug msg="   Network-infrastructure Provisioning: 3m30s"
time="2024-07-30T16:59:15Z" level=debug msg="Post-network, pre-machine Provisioning: 6m15s"
time="2024-07-30T16:59:15Z" level=debug msg="       Bootstrap Ignition Provisioning: 1s"
time="2024-07-30T16:59:15Z" level=debug msg="                  Machine Provisioning: 7m16s"
time="2024-07-30T16:59:15Z" level=debug msg="      Infrastructure Post-provisioning: 22s"
time="2024-07-30T16:59:15Z" level=debug msg="                    Bootstrap Complete: 13m58s"
time="2024-07-30T16:59:15Z" level=debug msg="                                   API: 12m16s"
time="2024-07-30T16:59:15Z" level=debug msg="                     Bootstrap Destroy: 1m27s"
time="2024-07-30T16:59:15Z" level=debug msg="           Cluster Operators Available: 17m46s"
time="2024-07-30T16:59:15Z" level=debug msg="              Cluster Operators Stable: 46s"
time="2024-07-30T16:59:15Z" level=info msg="Time elapsed: 52m6s"

@patrickdillon
Copy link
Contributor Author

/test e2e-azure-ovn-techpreview

@patrickdillon
Copy link
Contributor Author

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 2, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jhixson74
Copy link
Member

@patrickdillon I like this. Any particular reason it's still WIP?

@barbacbd
Copy link
Contributor

/uncc

@openshift-ci openshift-ci bot removed the request for review from barbacbd August 15, 2024 12:34
@patrickdillon
Copy link
Contributor Author

patrickdillon commented Oct 1, 2024

@patrickdillon I like this. Any particular reason it's still WIP?

I would want to look a bit more at the error handling before moving forward with this. In general, though, we decided it would be better to just pursue azure marketplace images, with their dramatic improvement in time than this.

Also I think we would want to refactor quite a bit in terms of encapsulating each resource into a function to increase legibility.

Copy link
Contributor

openshift-ci bot commented Dec 18, 2024

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-shared-vpc 4fc3c63 link false /test e2e-azure-ovn-shared-vpc
ci/prow/e2e-azurestack 4fc3c63 link false /test e2e-azurestack
ci/prow/okd-unit 4fc3c63 link true /test okd-unit
ci/prow/azure-ovn-marketplace-images 4fc3c63 link false /test azure-ovn-marketplace-images
ci/prow/artifacts-images 4fc3c63 link true /test artifacts-images
ci/prow/e2e-openstack-ovn 4fc3c63 link true /test e2e-openstack-ovn
ci/prow/aro-unit 4fc3c63 link true /test aro-unit
ci/prow/openstack-manifests 4fc3c63 link true /test openstack-manifests
ci/prow/okd-scos-images 4fc3c63 link true /test okd-scos-images
ci/prow/e2e-azure-ovn 4fc3c63 link true /test e2e-azure-ovn
ci/prow/e2e-vsphere-ovn-upi 4fc3c63 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-azure-ovn-upi 4fc3c63 link true /test e2e-azure-ovn-upi
ci/prow/e2e-gcp-ovn 4fc3c63 link true /test e2e-gcp-ovn
ci/prow/e2e-metal-ipi-ovn-ipv6 4fc3c63 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/govet 4fc3c63 link true /test govet
ci/prow/integration-tests 4fc3c63 link true /test integration-tests
ci/prow/e2e-gcp-ovn-upi 4fc3c63 link true /test e2e-gcp-ovn-upi
ci/prow/terraform-verify-vendor 4fc3c63 link true /test terraform-verify-vendor
ci/prow/e2e-aws-ovn 4fc3c63 link true /test e2e-aws-ovn
ci/prow/e2e-agent-compact-ipv4 4fc3c63 link true /test e2e-agent-compact-ipv4
ci/prow/altinfra-images 4fc3c63 link true /test altinfra-images
ci/prow/integration-tests-nodejoiner 4fc3c63 link true /test integration-tests-nodejoiner
ci/prow/golint 4fc3c63 link true /test golint
ci/prow/e2e-vsphere-ovn 4fc3c63 link true /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants