Skip to content

OCPBUGS-56104,OCPCLOUD-2893: Add related objects to must-gather config#267

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
honza:related-objects
May 13, 2025
Merged

OCPBUGS-56104,OCPCLOUD-2893: Add related objects to must-gather config#267
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
honza:related-objects

Conversation

@honza
Copy link
Member

@honza honza commented Mar 3, 2025

No description provided.

@honza honza changed the title Add related objects to must-gather config OCPCLOUD-2893: Add related objects to must-gather config Mar 3, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 3, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 3, 2025

@honza: This pull request references OCPCLOUD-2893 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from elmiko and racheljpg March 3, 2025 18:18
@honza honza changed the title OCPCLOUD-2893: Add related objects to must-gather config WIP: OCPCLOUD-2893: Add related objects to must-gather config Mar 4, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2025
@honza honza force-pushed the related-objects branch 2 times, most recently from 5ab4e21 to be85c0f Compare March 6, 2025 21:59
@nrb
Copy link
Contributor

nrb commented Mar 11, 2025

/assign @damdo

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as a first pass this is good - ideally, we'd use discovery to grab all CRDs in the group, but I think that at the moment it's more important to have the functionality than to spend time on the generator.

Also, the resources should be plural; hopefully you can just accept the suggestions to have it done automatically.

@honza
Copy link
Member Author

honza commented Mar 12, 2025

ideally, we'd use discovery to grab all CRDs in the group,

Is there any prior art on this? I looked but haven't found anything. I have a WIP uncommitted code that tries to do this but it feels wrong.

@honza honza force-pushed the related-objects branch from be85c0f to 0954a1c Compare March 12, 2025 15:59
@nrb
Copy link
Contributor

nrb commented Mar 12, 2025

Is there any prior art on this?

Not that I can find; I was hoping so, but perhaps it doesn't exist on purpose.

it feels wrong.

Why's that?

@honza
Copy link
Member Author

honza commented Mar 12, 2025

func (r *ClusterOperatorStatusClient) relatedObjects() []configv1.ObjectReference {
	// TBD: Add an actual set of object references from getResources method
	infra, err := util.GetInfra(context.TODO(), r.Client)
	if err != nil {
		// TODO
	}
	platform, err := util.GetPlatform(context.TODO(), infra)
	if err != nil {
		// TODO
	}

	m := r.Scheme().AllKnownTypes()

	for groupVersionKind, t := range m {

		if strings.Contains(groupVersionKind.Kind, "Options") {
			continue
		}
		if groupVersionKind.Kind == "WatchEvent" {
			continue
		}
		if strings.Contains(groupVersionKind.Group, "cluster.x-k8s.io") {
			field, found := t.FieldByName("ObjectMeta")
			if !found {
				continue
			}

			// TODO: match based on platform ^^^
		}
	}

	return []configv1.ObjectReference{
		{Resource: "namespaces", Name: controllers.DefaultManagedNamespace},
		{Group: configv1.GroupName, Resource: "clusteroperators", Name: controllers.ClusterOperatorName},
		{Resource: "namespaces", Name: r.ManagedNamespace},
		{Group: "", Resource: "serviceaccounts", Name: "cluster-capi-operator", Namespace: controllers.DefaultManagedNamespace},
		{Group: "", Resource: "configmaps", Name: "cluster-capi-operator-images", Namespace: controllers.DefaultManagedNamespace},
		{Group: "apps", Resource: "deployments", Name: "cluster-capi-operator", Namespace: controllers.DefaultManagedNamespace},
		{Group: "cluster.x-k8s.io", Resource: "clusters", Namespace: controllers.DefaultManagedNamespace},
		{Group: "cluster.x-k8s.io", Resource: "machines", Namespace: controllers.DefaultManagedNamespace},
	}
}

@damdo
Copy link
Member

damdo commented Mar 14, 2025

Hey @honza thanks for this PR.

Looking at the gather-extra from one of the e2e jobs I only see machines/machinesets, and not for example core CAPI cluster.

From the classic must-gather instead I also see cluster, but I still don't see awscluster for example.

Any ideas on why?

@theobarberbany
Copy link
Contributor

👋 Reviving this, as having this functionality in must-gathers would be awesome.

Is there anything I can do to help?

@honza, I think your code in the above comment looks ok as a starting point. If we can get it commited we can review? :)

@honza
Copy link
Member Author

honza commented Apr 28, 2025

Let me clean it up and push.

@honza honza force-pushed the related-objects branch from 0954a1c to 5517e54 Compare April 28, 2025 18:47
@honza
Copy link
Member Author

honza commented Apr 28, 2025

In the classic gather, we now have:

$ tree namespaces/openshift-cluster-api/infrastructure.cluster.x-k8s.io/
infrastructure.cluster.x-k8s.io/
├── awsclusters
│   └── ci-op-yi9j3qyc-c3c99-58vxn.yaml
├── awsmachines
│   ├── ci-op-yi9j3qyc-c3c99-58vxn-worker-us-east-2b-7r2bh.yaml
│   ├── ci-op-yi9j3qyc-c3c99-58vxn-worker-us-east-2b-fk8k8.yaml
│   └── ci-op-yi9j3qyc-c3c99-58vxn-worker-us-east-2c-2qrcx.yaml
└── awsmachinetemplates
    ├── ci-op-yi9j3qyc-c3c99-58vxn-worker-us-east-2b.yaml
    └── ci-op-yi9j3qyc-c3c99-58vxn-worker-us-east-2c.yaml

4 directories, 6 files

@honza
Copy link
Member Author

honza commented Apr 28, 2025

I think the extra gather commands are here. We could add a big switch statement and collect the infra resources based on platform.

@honza honza changed the title WIP: OCPCLOUD-2893: Add related objects to must-gather config OCPCLOUD-2893: Add related objects to must-gather config May 1, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2025
@damdo
Copy link
Member

damdo commented May 2, 2025

I think the extra gather commands are here. We could add a big switch statement and collect the infra resources based on platform.

Yeah we could 👍

@honza honza force-pushed the related-objects branch from 5517e54 to 7e7360c Compare May 2, 2025 17:07
@honza
Copy link
Member Author

honza commented May 2, 2025

openshift/release#64484

@honza honza force-pushed the related-objects branch from 7e7360c to 21481d8 Compare May 5, 2025 13:18
@damdo
Copy link
Member

damdo commented May 9, 2025

@honza yes openshift/release#56322 covers the gather-extra, but for the must-gather, the change was missing. Thanks for adding it, let's confirm this is in the must-gather artifact before proceeding with the merge.

- group: "infrastructure.cluster.x-k8s.io"
name: ""
namespace: openshift-cluster-api
resource: awsclusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're actively managing these entries in your operator (which it looks like you are), I don't think you need to go through the work of manually listing them here in your ClusterOperator manifest. You need enough in the manifest that a must-gather while your operator isn't running collects enough to figure out why the operator isn't running. And it seems unlikely that folks would need to get all the way out to cloud-specific types to figure out why the operator wasn't running?

But also, 🤷, if you don't mind managing this list by hand or you find some way to automate it, having the cloud-specific types in this manifest doesn't hurt.

@honza
Copy link
Member Author

honza commented May 9, 2025

$ tree namespaces/openshift-cluster-api/cluster.x-k8s.io/machinesets/
namespaces/openshift-cluster-api/cluster.x-k8s.io/machinesets/
├── ci-op-hq03lh3v-c3c99-2nvtb-worker-us-east-2b.yaml
└── ci-op-hq03lh3v-c3c99-2nvtb-worker-us-east-2c.yaml

1 directory, 2 files

@honza
Copy link
Member Author

honza commented May 9, 2025

I also noticed this in the Azure job:

$ tree infrastructure.cluster.x-k8s.io
infrastructure.cluster.x-k8s.io
├── azureclusteridentities
│   └── ci-op-hq03lh3v-d9d1a-g4jn8.yaml
└── azureclusters
    └── ci-op-hq03lh3v-d9d1a-g4jn8.yaml

3 directories, 2 files

@honza
Copy link
Member Author

honza commented May 9, 2025

vsphere is the same

My theory is that the e2e test deletes those resources before we gather. Does that sound right?

@damdo
Copy link
Member

damdo commented May 12, 2025

I also noticed this in the Azure job:

$ tree infrastructure.cluster.x-k8s.io
infrastructure.cluster.x-k8s.io
├── azureclusteridentities
│   └── ci-op-hq03lh3v-d9d1a-g4jn8.yaml
└── azureclusters
    └── ci-op-hq03lh3v-d9d1a-g4jn8.yaml

3 directories, 2 files

Yes in azure we create an azureclusteridentity in the infracluster controller, as that's required there.
But we only do it for Azure.

azureClusterIdentity := &azurev1.AzureClusterIdentity{
ObjectMeta: metav1.ObjectMeta{
Name: r.Infra.Status.InfrastructureName,
Namespace: defaultCAPINamespace,
Annotations: map[string]string{
// The ManagedBy Annotation is set so CAPI infra providers ignore the InfraCluster object,
// as that's managed externally, in this case by the cluster-capi-operator's infracluster controller.
clusterv1.ManagedByAnnotation: managedByAnnotationValueClusterCAPIOperatorInfraClusterController,
},
},
Spec: azurev1.AzureClusterIdentitySpec{
Type: azurev1.ServicePrincipal,
AllowedNamespaces: &azurev1.AllowedNamespaces{NamespaceList: []string{defaultCAPINamespace}},
ClientID: string(azureClientID),
TenantID: string(azureTenantID),
ClientSecret: corev1.SecretReference{Name: clusterSecretName, Namespace: defaultCAPINamespace},
},
}
// The Azure Cluster Identtiy does not exist, so it needs to be created.
if err := r.Create(ctx, azureClusterIdentity); err != nil && !cerrors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create Azure Cluster Identity: %w", err)
}


vsphere is the same

My theory is that the e2e test deletes those resources before we gather. Does that sound right?

@honza What do you mean by "vsphere is the same"?
I didn't notice anything peculiar in the vpshere gathering:

[~/Downloads/must-gather (6)/registry-apps-build02-vmc-ci-openshift-org-ci-op-hq03lh3v-stable-sha256-72b6d67be00f7b3b2f2e00ac24632b102079b7d2c4f4f1effd38295b199fa19f] $ tree -L 2 namespaces/openshift-cluster-api/infrastructure.cluster.x-k8s.io/
namespaces/openshift-cluster-api/infrastructure.cluster.x-k8s.io/
└── vsphereclusters
    └── ci-op-hq03lh3v-6e495-d7vzg.yaml

2 directories, 1 file

@honza
Copy link
Member Author

honza commented May 12, 2025

What do you mean by "vsphere is the same"? I didn't notice anything peculiar in the vpshere gathering

I was distracted by the missing infra machines in the gather. It looks like we're good to go?

@damdo
Copy link
Member

damdo commented May 12, 2025

Yeah at the moment I would only expect AWS to still have machines at the end of the e2e-capi-techpreview test when the must-gather runs.

The e2e's for all of the other platforms do clean up the machines/machinesets they create during the test before considering it passed.

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/unhold

/lgtm

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2025
@damdo
Copy link
Member

damdo commented May 12, 2025

Thanks for working on this @honza 🎉

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2025
@honza
Copy link
Member Author

honza commented May 12, 2025

/test e2e-aws-ovn-serial

@damdo
Copy link
Member

damdo commented May 12, 2025

The serial job passed but timed out on tear-down hitting the 4h mark: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/267/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-ovn-serial/1920845242664226816

I am happy to override it

/override ci/prow/e2e-aws-ovn-serial

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2025

@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-serial

Details

In response to this:

The serial job passed but timed out on tear-down hitting the 4h mark: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/267/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-ovn-serial/1920845242664226816

I am happy to override it

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@damdo
Copy link
Member

damdo commented May 12, 2025

/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d131690 and 2 for PR HEAD 706d6f9 in total

@damdo
Copy link
Member

damdo commented May 13, 2025

The serial job passed but timed out on tear-down hitting the 4h mark: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/267/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-ovn-serial/1920845242664226816

I am happy to override it

/override ci/prow/e2e-aws-ovn-serial

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2025

@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-serial

Details

In response to this:

The serial job passed but timed out on tear-down hitting the 4h mark: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/267/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-ovn-serial/1920845242664226816

I am happy to override it

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@damdo
Copy link
Member

damdo commented May 13, 2025

/test unit

1 similar comment
@damdo
Copy link
Member

damdo commented May 13, 2025

/test unit

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2025

@honza: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 39748f2 into openshift:main May 13, 2025
23 checks passed
@damdo
Copy link
Member

damdo commented May 13, 2025

/cherry-pick release-4.19

@openshift-cherrypick-robot

@damdo: new pull request created: #296

Details

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@damdo
Copy link
Member

damdo commented May 13, 2025

/retitle OCPBUGS-56104,OCPCLOUD-2893: Add related objects to must-gather config

@openshift-ci openshift-ci bot changed the title OCPCLOUD-2893: Add related objects to must-gather config OCPBUGS-56104,OCPCLOUD-2893: Add related objects to must-gather config May 13, 2025
@openshift-ci-robot
Copy link

@honza: Jira Issue OCPBUGS-56104: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-56104 has been moved to the MODIFIED state.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-capi-operator
This PR has been included in build ose-cluster-capi-operator-container-v4.20.0-202505140744.p0.g39748f2.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants