OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety#434
OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety#434mdbooth wants to merge 11 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughRefactors manifest generation to an object-centric model ([]client.Object), adds a Go-based kustomize compile/write pipeline, and introduces provider-image discovery/extraction, profile metadata loading, sorting and application of provider manifests during reconciliation; integrates provider-images into operator setup and manifest outputs. (32 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler as Reconciler
participant Controller as CapiInstallerController
participant Reader as providerimages.ReadProviderImages
participant Fetcher as remoteImageFetcher
participant Registry as ContainerRegistry
participant FS as Filesystem
Reconciler->>Controller: reconcile(ctx)
Controller->>Reader: ReadProviderImages(ctx, k8sClient, containerImages, providerImageDir)
Reader->>Fetcher: Fetch(imageRef) using keychain
Fetcher->>Registry: HTTP request for image blob
Registry-->>Fetcher: return v1.Image
Fetcher->>FS: extract /capi-operator-manifests from layers
FS->>FS: write files, discover profiles, parse metadata.yaml
FS-->>Reader: []ProviderImageManifests
Reader-->>Controller: provider list
Controller->>Controller: sortProvidersByInstallOrder(...)
loop per provider
Controller->>FS: read manifests.yaml (reader)
Controller->>Controller: extractManifests(reader) -> []docs
Controller->>Controller: applyProviderComponents(ctx, docs)
end
Controller-->>Reconciler: reconciliation result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-techpreview |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 |
1ab7f33 to
5087e3b
Compare
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 |
|
/test e2e-aws-ovn-techpreview |
dab2c52 to
7c0b761
Compare
f3ba06c to
a8a93e3
Compare
|
/test e2e-aws-ovn-techpreview |
|
/test e2e-aws-ovn-techpreview |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
2 similar comments
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
The e2e-aws-ovn-techpreview pass shows the modifications to the capiinstaller controller are backwards compatible with current transport configmaps. openshift/cluster-api-provider-aws#584 and openshift/cluster-api#259 update CAPA and core CAPI to publish manifests only in the provider images. They do not publish transport configmaps. The multi-pr test pass with these 2 PRs shows that pulling manifests from images is working. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
manifests-gen/customizations.go (1)
164-164:⚠️ Potential issue | 🟡 MinorMinor: Typo in panic message.
The panic message has an errant "CustomResourceDefinition" suffix that appears to be a copy-paste artifact.
🔧 Proposed fix
- panic("can't find secret from cert:CustomResourceDefinition " + certNN) + panic("can't find secret from cert: " + certNN)
🤖 Fix all issues with AI agents
In `@cmd/capi-operator/main.go`:
- Around line 147-151: The call to providerimages.ReadProviderImages is using
context.Background() which ignores the setupControllers signal-aware context
(ctx); change the call in setupControllers to pass the received ctx instead
(i.e., replace context.Background() with ctx) so
providerimages.ReadProviderImages(...) is cancellable on shutdown and uses the
same signal-handling lifecycle as the rest of the controller initialization.
In `@manifests/0000_30_cluster-api_11_deployment.yaml`:
- Line 98: The deployment references the webhook TLS secret as
cluster-capi-operator-webhook-service-cert but the webhook Service annotation
still uses the old name capi-controllers-webhook-service-cert; update the
Service's admission webhook annotation (the metadata.annotations key that sets
the webhook TLS secret) to use cluster-capi-operator-webhook-service-cert so the
secret name is consistent with the Deployment's secretName value.
🧹 Nitpick comments (1)
pkg/providerimages/providerimages.go (1)
140-163: Consider propagating context cancellation errors.The errgroup pattern here collects all errors via channel but ignores any context cancellation. If the context is cancelled (e.g., shutdown),
g.Wait()will return the context error, but it's discarded. This is likely acceptable for startup-time processing, but worth noting._ = g.Wait() // We're not actually returning errors directly
| secret: | ||
| defaultMode: 420 | ||
| secretName: capi-controllers-webhook-service-cert | ||
| secretName: cluster-capi-operator-webhook-service-cert |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Search for references to both old and new webhook secret names
echo "=== Searching for old secret name (should not exist) ==="
rg -n "capi-controllers-webhook-service-cert" --type yaml
echo ""
echo "=== Searching for new secret name ==="
rg -n "cluster-capi-operator-webhook-service-cert" --type yamlRepository: openshift/cluster-capi-operator
Length of output: 437
Fix webhook certificate secret name mismatch across manifests.
The webhook certificate secret rename is incomplete. The deployment at line 98 references the new name cluster-capi-operator-webhook-service-cert, but the webhook service annotation in manifests/0000_30_cluster-api_02_webhook-service.yaml:9 still uses the old name capi-controllers-webhook-service-cert. This mismatch will cause the webhook to fail at runtime. Update the service annotation to use the new secret name consistently.
🤖 Prompt for AI Agents
In `@manifests/0000_30_cluster-api_11_deployment.yaml` at line 98, The deployment
references the webhook TLS secret as cluster-capi-operator-webhook-service-cert
but the webhook Service annotation still uses the old name
capi-controllers-webhook-service-cert; update the Service's admission webhook
annotation (the metadata.annotations key that sets the webhook TLS secret) to
use cluster-capi-operator-webhook-service-cert so the secret name is consistent
with the Deployment's secretName value.
f49c453 to
8f1aad3
Compare
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
1 similar comment
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@mdbooth: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/providerimages/providerimages.go`:
- Around line 140-194: readProviderImages currently discards all successful
manifests when any image fetch fails (it returns nil, err), which contradicts
the concurrent/fault-tolerant design; change the return behavior so that
readProviderImages returns partial results along with the aggregated error
instead of nil when some images succeeded (i.e., return providerImages, err),
and ensure any callers of readProviderImages handle a non-nil slice plus error;
locate the aggregation logic around providerImageResult, the loop collecting
into providerImages, and the final return in readProviderImages, and update the
final return path (and tests/doc comments) to reflect returning partial results
on partial failures.
🧹 Nitpick comments (1)
pkg/providerimages/providerimages.go (1)
95-118: Documentation still mentions "map of provider names" but parameter is[]string.The comment at line 98 says "containerImages is a map of provider names to provider image references" but the actual parameter type is
[]string. This appears to be a documentation inconsistency from an earlier revision.📝 Suggested doc fix
// ReadProviderImages returns a list of ProviderImageManifests read directly // from operand container images. // -// containerImages is a map of provider names to provider image references +// containerImages is a list of provider image references //
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
Adds a flag, profile-name allowing specification of a profile (and therefore subdirectory) to output manifests and metadata to.
When parsing images, the logic now checks for multiple 'profiles' (sub directories) under /capi-operator-manifests. If multiple are found, we return them all.
| Old Field | New Location | |-----------|--------------| | `ProviderName` | `Name` | | `ProviderImageRef` | `SelfImageRef` | | `ProviderType` | `Attributes["type"]` | | `ProviderVersion` | `Attributes["version"]` | | `OCPPlatform` | `OCPPlatform` (unchanged) | | (new) | `InstallOrder` | Now installs based on `InstallOrder` rather than sorting on type/version.
- Adds static related objects so we can debug when we miss-configure something - Fixes names in related objects in operatorstatus to reflect new names in manifests
Debug Logging
- Added detailed logging when reconciling provider images (manifestsPath, type, version, profile, platform)
- Added logging for manifest extraction count
- Added component count summary before applying (CRDs, deployments, static components)
- Added per-resource logging when applying CRDs and deployments
Error Handling Improvements
- Fixed applyProviderImage to use named return so reader.Close() errors are not silently lost
- Fixed type assertion error messages to show actual type received (expected X but got %T)
- Added errUnexpectedResourceType static error for linter compliance
- Updated setDegradedCondition to accept and surface the actual error in ClusterOperator status conditions
- Changed degraded condition logging from Info to Error level
Provider Images Library
- Added error logging when provider image processing fails
- Improved error wrapping with context for:
- Manifest extraction failures (includes output directory)
- Profile discovery failures (includes directory path)
- Manifest write failures (includes profile name)
80d3621 to
ab78f98
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
manifests/0000_30_cluster-api_11_deployment.yaml (1)
29-83:⚠️ Potential issue | 🟠 MajorAdd container securityContext to enforce non-root execution and match SCC requirements.
The pod annotation declares
openshift.io/required-scc: restricted-v2(line 23), which mandatesrunAsNonRoot: true,allowPrivilegeEscalation: false, capability dropping, and seccomp. However, the containers lack explicitsecurityContextfields, creating a security posture gap. Adding explicit container-level hardening ensures fail-fast validation at deployment time rather than relying solely on runtime SCC enforcement.The Dockerfile does not include a USER directive, meaning the image defaults to root; explicit securityContext is essential to enforce the non-root intent declared by the SCC annotation.
🔒 Proposed hardening (apply to both containers)
- name: capi-controllers image: registry.ci.openshift.org/openshift:cluster-capi-operator + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefault @@ - name: machine-api-migration image: registry.ci.openshift.org/openshift:cluster-capi-operator + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefaultVerify that both binaries (capi-controllers, machine-api-migration) and the provider-images emptyDir mount function correctly when run as non-root before enforcing
runAsNonRoot: true.pkg/conversion/capi2mapi/aws.go (2)
548-553:⚠️ Potential issue | 🟠 MajorGuard the PlacementGroupPartition int64→int32 cast against overflow.
The function performs an unchecked cast from int64 to int32, which triggers the G115 linter warning (unsafe type conversion with potential data loss). Add bounds checking to prevent silent overflow for values outside the int32 range.
Suggested fix
func convertAWSPlacementGroupPartition(in int64) *int32 { if in == 0 { return nil } + if in < math.MinInt32 || in > math.MaxInt32 { + return nil // consider returning an error to the caller instead + } return ptr.To(int32(in)) }
525-531:⚠️ Potential issue | 🟠 MajorAdd a lower-bound check before casting Throughput to int32.
The current validation only checks the upper bound (
math.MaxInt32). To satisfy gosec's G115 rule for unchecked int64→int32 conversions and defend against negative values, add a lower-bound guard.🔧 Suggested fix
if volume.Throughput != nil { - if *volume.Throughput > math.MaxInt32 { - return mapiv1beta1.BlockDeviceMappingSpec{}, field.Invalid(fldPath.Child("throughput"), *volume.Throughput, "throughput exceeds maximum int32 value") - } + if *volume.Throughput < 0 || *volume.Throughput > math.MaxInt32 { + return mapiv1beta1.BlockDeviceMappingSpec{}, field.Invalid( + fldPath.Child("throughput"), + *volume.Throughput, + "throughput must be non-negative and fit in int32", + ) + } bdm.EBS.ThroughputMib = ptr.To(int32(*volume.Throughput)) }
🧹 Nitpick comments (3)
manifests-gen/generate.go (1)
103-103: UnnecessaryO_APPENDflag withO_TRUNC.When using
O_TRUNC, the file is truncated to zero length, soO_APPENDhas no practical effect since writes will always be at the end of an empty/growing file. This applies to bothwriteManifests(line 103) andwriteMetadata(line 138).🧹 Suggested simplification
- manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY|os.O_APPEND, 0600) + manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0600)cmd/capi-operator/main.go (1)
154-166: Make provider image refs deterministic and deduplicated.
maps.Valuesis randomized and can include duplicates, which may lead to non-deterministic processing or redundant pulls. Sorting + compacting keeps behavior stable and avoids extra work.♻️ Proposed fix
- containerImageRefs := slices.Collect(maps.Values(containerImages)) + containerImageRefs := slices.Collect(maps.Values(containerImages)) + slices.Sort(containerImageRefs) + containerImageRefs = slices.Compact(containerImageRefs)manifests/0000_30_cluster-api_11_deployment.yaml (1)
95-100: Consider adding anemptyDir.sizeLimitfor the cache.This prevents unbounded growth of provider image caches on the node.
♻️ Example (adjust size as appropriate)
- name: provider-images - emptyDir: {} + emptyDir: + sizeLimit: 1Gi
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
e2e/go.mod (1)
80-121:⚠️ Potential issue | 🔴 CriticalRetract github.com/klauspost/compress to a stable version; v1.18.1 is marked retracted due to invalid encoding.
The klauspost/compress v1.18.1 indirect dependency is a retracted version with critical flate/zip/gzip encoding bugs. Replace it with a stable, non-retracted version (e.g., v1.17.x or newer stable release).
golang.org/x/oauth2 v0.33.0 and golang.org/x/tools v0.39.0 are compatible with Go 1.24 and have no breaking changes affecting this codebase.
go.mod (1)
12-33:⚠️ Potential issue | 🔴 CriticalDo not use klauspost/compress v1.18.1—it was retracted due to invalid flate/zip/gzip encoding.
Use v1.18.2 or later instead. This version introduced bugs affecting compression correctness, making it unsuitable for production use.
Additionally, flag go-containerregistry v0.20.7 for daemon client usage: there is a known compatibility issue with Docker v29 stemming from the
github.com/docker/dockermodule deprecation (migrating togithub.com/moby/moby/...). If the provider-image workflow relies on daemon client functionality, verify that this does not impact your integration.The remaining dependencies (docker/cli v29.2.1, golang.org/x/sync v0.18.0, golang.org/x/tools v0.39.0) have no breaking changes relative to their prior versions and are Go 1.24 compatible.
e80de4b to
b868525
Compare
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
We need to pull in the cluster-api image as that's where the core CAPI CRDs live
|
@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259 |
|
@mdbooth: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.
Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.
Summary by CodeRabbit
New Features
Improvements
Refactor
Tests