Skip to content

OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety#434

Open
mdbooth wants to merge 11 commits intoopenshift:mainfrom
openshift-cloud-team:manifests-gen
Open

OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety#434
mdbooth wants to merge 11 commits intoopenshift:mainfrom
openshift-cloud-team:manifests-gen

Conversation

@mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Dec 16, 2025

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

    • Load and apply provider manifests from container images with profile support.
    • RBAC Role/RoleBinding for pull-secret access and a configurable provider-images volume.
  • Improvements

    • More robust manifest generation and resource processing across API groups (including webhooks and CRDs).
    • Controller reconciles provider images, extracts/applies provider manifests, and reports degraded state on errors.
    • ClusterOperator status includes related objects; CRD-driven protection policy generation.
  • Refactor

    • Migration to typed Kubernetes objects, new manifest-generation CLI flow, and dependency reorganization.
  • Tests

    • Added tests covering provider image extraction and controller helpers.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

Refactors manifest generation to an object-centric model ([]client.Object), adds a Go-based kustomize compile/write pipeline, and introduces provider-image discovery/extraction, profile metadata loading, sorting and application of provider manifests during reconciliation; integrates provider-images into operator setup and manifest outputs. (32 words)

Changes

Cohort / File(s) Summary
Manifest generation & helpers
manifests-gen/customizations.go, manifests-gen/util.go, manifests-gen/generate.go, manifests-gen/kustomization.yaml
Switches to []client.Object processing, adds conversion/type helpers (mustConvert, convert, stripStatus, setTypeMetadataFromScheme), group-based filtering, CRD-driven InfraCluster protection policy generation, and a Go kustomize compile + write pipeline. Adds kustomize component and Secret local-config patch.
CLI & flags
manifests-gen/main.go
Centralizes CLI into cmdlineOptions and attributeFlags, adds validateFlags() with aggregated errors and platform validation.
Removed legacy provider pipeline
manifests-gen/providercustomizations.go (deleted), manifests-gen/providers.go (deleted)
Removes prior provider-specific customization dispatcher and the previous provider orchestration (clusterctl/kustomize pipeline).
Provider images package
pkg/providerimages/providerimages.go, pkg/providerimages/pullsecret.go, pkg/providerimages/providerimages_test.go
New package to parse pull-secrets into a Docker keychain, fetch images (imageFetcher/remoteImageFetcher), extract /capi-operator-manifests across layered image filesystem, discover profiles, load metadata, write manifests with content hashes, and return ProviderImageManifests. Includes extensive unit tests and a fake fetcher.
Installer controller changes
pkg/controllers/capiinstaller/capi_installer_controller.go, pkg/controllers/capiinstaller/capi_installer_controller_test.go
Adds ProviderImages field, reconcileProviderImages(), applyProviderImage(), sortProvidersByInstallOrder(), refactors manifest extraction to reader-based flow, improves type-safety for CRDs/Deployments, enhances degraded-condition handling, and updates tests.
Operator binary integration
cmd/capi-operator/main.go
Loads provider images via ReadProviderImages(), supplies ProviderImages to installer controller, introduces PROVIDER_IMAGE_DIR env var and default, and refactors controller setup/error propagation.
Config parsing changes
pkg/util/readconfig.go
Removes provider-file reader and YAML parsing; ReadImagesFile() now parses JSON into map[string]string.
Manifests updates
manifests/0000_30_cluster-api_03_rbac_roles.yaml, manifests/0000_30_cluster-api_04_rbac_bindings.yaml, manifests/0000_30_cluster-api_11_deployment.yaml, manifests/0000_30_cluster-api_12_clusteroperator.yaml, manifests/0000_30_cluster-api-installer_04_images.configmap.yaml, manifests/image-references
Adds RBAC Role/RoleBinding for pull-secret access, mounts provider-images volume and env var in the deployment, updates ClusterOperator status.relatedObjects, and adds image reference entries / images.json mapping.
Operator status & naming
pkg/operatorstatus/operator_status.go
Updates RelatedObjects references to new serviceaccount/deployment names and pluralized validating admission policy resource names.
Dependencies & modules
go.mod, manifests-gen/go.mod, e2e/go.mod, hack/tools/go.mod
Adds/updates dependencies for container image handling and kustomize/controller-runtime ecosystems (e.g., google/go-containerregistry, docker/cli, OpenShift APIs), plus indirects and a local replace.
Small lint edits
pkg/controllers/infracluster/azure.go, pkg/conversion/capi2mapi/aws.go
Removes nolint:gosec directives; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler as Reconciler
    participant Controller as CapiInstallerController
    participant Reader as providerimages.ReadProviderImages
    participant Fetcher as remoteImageFetcher
    participant Registry as ContainerRegistry
    participant FS as Filesystem

    Reconciler->>Controller: reconcile(ctx)
    Controller->>Reader: ReadProviderImages(ctx, k8sClient, containerImages, providerImageDir)
    Reader->>Fetcher: Fetch(imageRef) using keychain
    Fetcher->>Registry: HTTP request for image blob
    Registry-->>Fetcher: return v1.Image
    Fetcher->>FS: extract /capi-operator-manifests from layers
    FS->>FS: write files, discover profiles, parse metadata.yaml
    FS-->>Reader: []ProviderImageManifests
    Reader-->>Controller: provider list
    Controller->>Controller: sortProvidersByInstallOrder(...)
    loop per provider
      Controller->>FS: read manifests.yaml (reader)
      Controller->>Controller: extractManifests(reader) -> []docs
      Controller->>Controller: applyProviderComponents(ctx, docs)
    end
    Controller-->>Reconciler: reconciliation result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibble layers, tar and light,

I fetch the manifests through the night.
I sniff the profiles, sort by order true,
I stitch and apply each manifest for you.
Hooray — provider images hop into view!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety' clearly and specifically summarizes the main objective of the pull request: rewriting the manifests-gen module to support upgrade safety through image-embedded provider manifests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign racheljpg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mdbooth mdbooth changed the title Rewrite manifests-gen to support upgrade safety OCPCLOUD-3327: Rewrite manifests-gen to support upgrade safety Dec 16, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2025

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Remove provider customisation for PowerVS
  • manifests-gen: Rewrite to support Update Safety
  • capiinstaller: Compatibility with new manifests-gen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 16, 2025

/test e2e-aws-ovn-techpreview

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 16, 2025

/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 17, 2025

/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 17, 2025

/test e2e-aws-ovn-techpreview

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2025
@mdbooth mdbooth force-pushed the manifests-gen branch 3 times, most recently from f3ba06c to a8a93e3 Compare December 18, 2025 16:54
@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 19, 2025

/test e2e-aws-ovn-techpreview
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 19, 2025

/test e2e-aws-ovn-techpreview

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 19, 2025

/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

2 similar comments
@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 19, 2025

/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 19, 2025

/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

@mdbooth
Copy link
Contributor Author

mdbooth commented Dec 21, 2025

The e2e-aws-ovn-techpreview pass shows the modifications to the capiinstaller controller are backwards compatible with current transport configmaps.

openshift/cluster-api-provider-aws#584 and openshift/cluster-api#259 update CAPA and core CAPI to publish manifests only in the provider images. They do not publish transport configmaps.

The multi-pr test pass with these 2 PRs shows that pulling manifests from images is working.

@mdbooth mdbooth marked this pull request as ready for review January 5, 2026 13:09
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
@openshift-ci openshift-ci bot requested review from nrb and theobarberbany January 5, 2026 13:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
manifests-gen/customizations.go (1)

164-164: ⚠️ Potential issue | 🟡 Minor

Minor: Typo in panic message.

The panic message has an errant "CustomResourceDefinition" suffix that appears to be a copy-paste artifact.

🔧 Proposed fix
-				panic("can't find secret from cert:CustomResourceDefinition " + certNN)
+				panic("can't find secret from cert: " + certNN)
🤖 Fix all issues with AI agents
In `@cmd/capi-operator/main.go`:
- Around line 147-151: The call to providerimages.ReadProviderImages is using
context.Background() which ignores the setupControllers signal-aware context
(ctx); change the call in setupControllers to pass the received ctx instead
(i.e., replace context.Background() with ctx) so
providerimages.ReadProviderImages(...) is cancellable on shutdown and uses the
same signal-handling lifecycle as the rest of the controller initialization.

In `@manifests/0000_30_cluster-api_11_deployment.yaml`:
- Line 98: The deployment references the webhook TLS secret as
cluster-capi-operator-webhook-service-cert but the webhook Service annotation
still uses the old name capi-controllers-webhook-service-cert; update the
Service's admission webhook annotation (the metadata.annotations key that sets
the webhook TLS secret) to use cluster-capi-operator-webhook-service-cert so the
secret name is consistent with the Deployment's secretName value.
🧹 Nitpick comments (1)
pkg/providerimages/providerimages.go (1)

140-163: Consider propagating context cancellation errors.

The errgroup pattern here collects all errors via channel but ignores any context cancellation. If the context is cancelled (e.g., shutdown), g.Wait() will return the context error, but it's discarded. This is likely acceptable for startup-time processing, but worth noting.

_ = g.Wait() // We're not actually returning errors directly

secret:
defaultMode: 420
secretName: capi-controllers-webhook-service-cert
secretName: cluster-capi-operator-webhook-service-cert
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search for references to both old and new webhook secret names

echo "=== Searching for old secret name (should not exist) ==="
rg -n "capi-controllers-webhook-service-cert" --type yaml

echo ""
echo "=== Searching for new secret name ==="
rg -n "cluster-capi-operator-webhook-service-cert" --type yaml

Repository: openshift/cluster-capi-operator

Length of output: 437


Fix webhook certificate secret name mismatch across manifests.

The webhook certificate secret rename is incomplete. The deployment at line 98 references the new name cluster-capi-operator-webhook-service-cert, but the webhook service annotation in manifests/0000_30_cluster-api_02_webhook-service.yaml:9 still uses the old name capi-controllers-webhook-service-cert. This mismatch will cause the webhook to fail at runtime. Update the service annotation to use the new secret name consistently.

🤖 Prompt for AI Agents
In `@manifests/0000_30_cluster-api_11_deployment.yaml` at line 98, The deployment
references the webhook TLS secret as cluster-capi-operator-webhook-service-cert
but the webhook Service annotation still uses the old name
capi-controllers-webhook-service-cert; update the Service's admission webhook
annotation (the metadata.annotations key that sets the webhook TLS secret) to
use cluster-capi-operator-webhook-service-cert so the secret name is consistent
with the Deployment's secretName value.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load provider images from registries, extract and cache per-profile manifests, and surface them to the installer.

  • Refactor

  • Reworked manifest generation to an object-centric, typed pipeline and centralized CLI validation.

  • Chores

  • Bumped container/Docker/OpenShift-related dependencies and updated module manifests.

  • Added RBAC Role/RoleBinding and mounted provider-images volume with env var.

  • Tests

  • Added comprehensive tests for provider image reading and manifest extraction.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@theobarberbany
Copy link
Contributor

/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

1 similar comment
@theobarberbany
Copy link
Contributor

/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load provider images from registries, extract/cache per-profile manifests, and surface them to the installer; CLI supports provider-image inputs.

  • Refactor

  • Reworked manifest generation to an object-centric, typed pipeline with centralized CLI validation and platform/profile handling.

  • Chores

  • Bumped dependency versions and refreshed module manifests; added kustomize component, RBAC Role/RoleBinding, and mounted provider-images volume with env var.

  • Tests

  • Added comprehensive tests for provider image reading and manifest extraction.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor Author

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks 👍

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

@mdbooth: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Thanks 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 5, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load provider images from registries, extract/cache per-profile manifests, and surface them to the installer; CLI accepts provider-image inputs.

  • Refactor

  • Reworked manifest generation to an object-centric, typed pipeline with centralized CLI validation and platform/profile handling.

  • Chores

  • Updated module/dependency versions, added kustomize component, RBAC Role/RoleBinding, provider-images volume mount and env var.

  • Tests

  • Added comprehensive tests for provider image reading and manifest extraction.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/providerimages/providerimages.go`:
- Around line 140-194: readProviderImages currently discards all successful
manifests when any image fetch fails (it returns nil, err), which contradicts
the concurrent/fault-tolerant design; change the return behavior so that
readProviderImages returns partial results along with the aggregated error
instead of nil when some images succeeded (i.e., return providerImages, err),
and ensure any callers of readProviderImages handle a non-nil slice plus error;
locate the aggregation logic around providerImageResult, the loop collecting
into providerImages, and the final return in readProviderImages, and update the
final return path (and tests/doc comments) to reflect returning partial results
on partial failures.
🧹 Nitpick comments (1)
pkg/providerimages/providerimages.go (1)

95-118: Documentation still mentions "map of provider names" but parameter is []string.

The comment at line 98 says "containerImages is a map of provider names to provider image references" but the actual parameter type is []string. This appears to be a documentation inconsistency from an earlier revision.

📝 Suggested doc fix
 // ReadProviderImages returns a list of ProviderImageManifests read directly
 // from operand container images.
 //
-// containerImages is a map of provider names to provider image references
+// containerImages is a list of provider image references
 //

@theobarberbany
Copy link
Contributor

/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

mdbooth and others added 8 commits February 5, 2026 15:08
Adds a flag, profile-name allowing specification of a profile (and
therefore subdirectory) to output manifests and metadata to.
When parsing images, the logic now checks for multiple 'profiles' (sub
directories) under /capi-operator-manifests.

If multiple are found, we return them all.
| Old Field | New Location |
|-----------|--------------|
| `ProviderName` | `Name` |
| `ProviderImageRef` | `SelfImageRef` |
| `ProviderType` | `Attributes["type"]` |
| `ProviderVersion` | `Attributes["version"]` |
| `OCPPlatform` | `OCPPlatform` (unchanged) |
| (new) | `InstallOrder` |

Now installs based on `InstallOrder` rather than sorting on
type/version.
- Adds static related objects so we can debug when we miss-configure
  something
- Fixes names in related objects in operatorstatus to reflect new names
  in manifests
  Debug Logging

  - Added detailed logging when reconciling provider images (manifestsPath, type, version, profile, platform)
  - Added logging for manifest extraction count
  - Added component count summary before applying (CRDs, deployments, static components)
  - Added per-resource logging when applying CRDs and deployments

  Error Handling Improvements

  - Fixed applyProviderImage to use named return so reader.Close() errors are not silently lost
  - Fixed type assertion error messages to show actual type received (expected X but got %T)
  - Added errUnexpectedResourceType static error for linter compliance
  - Updated setDegradedCondition to accept and surface the actual error in ClusterOperator status conditions
  - Changed degraded condition logging from Info to Error level

  Provider Images Library

  - Added error logging when provider image processing fails
  - Improved error wrapping with context for:
    - Manifest extraction failures (includes output directory)
    - Profile discovery failures (includes directory path)
    - Manifest write failures (includes profile name)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
manifests/0000_30_cluster-api_11_deployment.yaml (1)

29-83: ⚠️ Potential issue | 🟠 Major

Add container securityContext to enforce non-root execution and match SCC requirements.

The pod annotation declares openshift.io/required-scc: restricted-v2 (line 23), which mandates runAsNonRoot: true, allowPrivilegeEscalation: false, capability dropping, and seccomp. However, the containers lack explicit securityContext fields, creating a security posture gap. Adding explicit container-level hardening ensures fail-fast validation at deployment time rather than relying solely on runtime SCC enforcement.

The Dockerfile does not include a USER directive, meaning the image defaults to root; explicit securityContext is essential to enforce the non-root intent declared by the SCC annotation.

🔒 Proposed hardening (apply to both containers)
       - name: capi-controllers
         image: registry.ci.openshift.org/openshift:cluster-capi-operator
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop: ["ALL"]
+          seccompProfile:
+            type: RuntimeDefault
@@
       - name: machine-api-migration
         image: registry.ci.openshift.org/openshift:cluster-capi-operator
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop: ["ALL"]
+          seccompProfile:
+            type: RuntimeDefault

Verify that both binaries (capi-controllers, machine-api-migration) and the provider-images emptyDir mount function correctly when run as non-root before enforcing runAsNonRoot: true.

pkg/conversion/capi2mapi/aws.go (2)

548-553: ⚠️ Potential issue | 🟠 Major

Guard the PlacementGroupPartition int64→int32 cast against overflow.

The function performs an unchecked cast from int64 to int32, which triggers the G115 linter warning (unsafe type conversion with potential data loss). Add bounds checking to prevent silent overflow for values outside the int32 range.

Suggested fix
 func convertAWSPlacementGroupPartition(in int64) *int32 {
   if in == 0 {
     return nil
   }
 
+  if in < math.MinInt32 || in > math.MaxInt32 {
+    return nil // consider returning an error to the caller instead
+  }
   return ptr.To(int32(in))
 }

525-531: ⚠️ Potential issue | 🟠 Major

Add a lower-bound check before casting Throughput to int32.

The current validation only checks the upper bound (math.MaxInt32). To satisfy gosec's G115 rule for unchecked int64→int32 conversions and defend against negative values, add a lower-bound guard.

🔧 Suggested fix
  if volume.Throughput != nil {
-   if *volume.Throughput > math.MaxInt32 {
-     return mapiv1beta1.BlockDeviceMappingSpec{}, field.Invalid(fldPath.Child("throughput"), *volume.Throughput, "throughput exceeds maximum int32 value")
-   }
+   if *volume.Throughput < 0 || *volume.Throughput > math.MaxInt32 {
+     return mapiv1beta1.BlockDeviceMappingSpec{}, field.Invalid(
+       fldPath.Child("throughput"),
+       *volume.Throughput,
+       "throughput must be non-negative and fit in int32",
+     )
+   }
 
    bdm.EBS.ThroughputMib = ptr.To(int32(*volume.Throughput))
  }
🧹 Nitpick comments (3)
manifests-gen/generate.go (1)

103-103: Unnecessary O_APPEND flag with O_TRUNC.

When using O_TRUNC, the file is truncated to zero length, so O_APPEND has no practical effect since writes will always be at the end of an empty/growing file. This applies to both writeManifests (line 103) and writeMetadata (line 138).

🧹 Suggested simplification
-	manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY|os.O_APPEND, 0600)
+	manifestsFile, err := os.OpenFile(manifestsPathname, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0600)
cmd/capi-operator/main.go (1)

154-166: Make provider image refs deterministic and deduplicated.

maps.Values is randomized and can include duplicates, which may lead to non-deterministic processing or redundant pulls. Sorting + compacting keeps behavior stable and avoids extra work.

♻️ Proposed fix
-	containerImageRefs := slices.Collect(maps.Values(containerImages))
+	containerImageRefs := slices.Collect(maps.Values(containerImages))
+	slices.Sort(containerImageRefs)
+	containerImageRefs = slices.Compact(containerImageRefs)
manifests/0000_30_cluster-api_11_deployment.yaml (1)

95-100: Consider adding an emptyDir.sizeLimit for the cache.

This prevents unbounded growth of provider image caches on the node.

♻️ Example (adjust size as appropriate)
       - name: provider-images
-        emptyDir: {}
+        emptyDir:
+          sizeLimit: 1Gi

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 5, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load provider manifests directly from container images with extraction, caching and profile support.

  • RBAC Role and RoleBinding for pull-secret access to enable provider image authentication.

  • Controller support for a provider-images volume and configurable provider image directory.

  • Improvements

  • Enhanced manifest generation and resource processing with broader API-group handling.

  • ClusterOperator status now lists related objects for improved observability.

  • Controller reconciles and applies provider image manifests.

  • Internal Changes

  • Migrated processing to typed Kubernetes objects and removed legacy provider-specific customization logic.

  • Dependency and module updates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
e2e/go.mod (1)

80-121: ⚠️ Potential issue | 🔴 Critical

Retract github.com/klauspost/compress to a stable version; v1.18.1 is marked retracted due to invalid encoding.

The klauspost/compress v1.18.1 indirect dependency is a retracted version with critical flate/zip/gzip encoding bugs. Replace it with a stable, non-retracted version (e.g., v1.17.x or newer stable release).

golang.org/x/oauth2 v0.33.0 and golang.org/x/tools v0.39.0 are compatible with Go 1.24 and have no breaking changes affecting this codebase.

go.mod (1)

12-33: ⚠️ Potential issue | 🔴 Critical

Do not use klauspost/compress v1.18.1—it was retracted due to invalid flate/zip/gzip encoding.

Use v1.18.2 or later instead. This version introduced bugs affecting compression correctness, making it unsuitable for production use.

Additionally, flag go-containerregistry v0.20.7 for daemon client usage: there is a known compatibility issue with Docker v29 stemming from the github.com/docker/docker module deprecation (migrating to github.com/moby/moby/...). If the provider-image workflow relies on daemon client functionality, verify that this does not impact your integration.

The remaining dependencies (docker/cli v29.2.1, golang.org/x/sync v0.18.0, golang.org/x/tools v0.39.0) have no breaking changes relative to their prior versions and are Go 1.24 compatible.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 5, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load and apply provider manifests directly from container images with profile support.

  • RBAC Role/RoleBinding for pull-secret access and configurable provider-images volume.

  • Improvements

  • More robust manifest generation and resource processing across API groups.

  • Controller now reconciles provider images and applies provider manifests.

  • ClusterOperator status includes related objects for better observability.

  • CRD-driven policy generation and additional resource handling.

  • Internal Changes

  • Migration to typed Kubernetes objects and dependency updates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

We need to pull in the cluster-api image as that's where the core CAPI
CRDs live
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 5, 2026

@mdbooth: This pull request references OCPCLOUD-3327 which is a valid jira issue.

Details

In response to this:

Rewrite manifests-gen to support embedding CAPI installer manifests in the provider image instead of in a transport configmap.

Update the CAPI installer controller to support the new embedded manifests in addition to the existing transport configmaps. This allows us to have a smooth transition period while we update all providers.

Summary by CodeRabbit

  • New Features

  • Load and apply provider manifests from container images with profile support.

  • RBAC Role/RoleBinding for pull-secret access and a configurable provider-images volume.

  • Improvements

  • More robust manifest generation and resource processing across API groups (including webhooks and CRDs).

  • Controller reconciles provider images, extracts/applies provider manifests, and reports degraded state on errors.

  • ClusterOperator status includes related objects; CRD-driven protection policy generation.

  • Refactor

  • Migration to typed Kubernetes objects, new manifest-generation CLI flow, and dependency reorganization.

  • Tests

  • Added tests covering provider image extraction and controller helpers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@theobarberbany
Copy link
Contributor

/testwith openshift/cluster-capi-operator/main/e2e-aws-capi-techpreview openshift/cluster-api-provider-aws#584 openshift/cluster-api#259

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 5, 2026

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal3-capi-techpreview 3da26fb link false /test e2e-metal3-capi-techpreview
ci/prow/e2e-azure-ovn-techpreview 1b4e131 link false /test e2e-azure-ovn-techpreview
ci/prow/okd-scos-images 932ab21 link true /test okd-scos-images
ci/prow/images 932ab21 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants