Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDN-5568: only provision workloads when network creation has started #29348

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Dec 5, 2024

No description provided.

@openshift-ci openshift-ci bot requested review from knobunc and trozet December 5, 2024 18:15
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 5, 2024

/test e2e-aws-ovn-virt-techpreview

@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from 47890fa to 7542b3d Compare December 9, 2024 15:30
udnManifest := generateUserDefinedNetworkManifest(&c)
By(fmt.Sprintf("Creating UserDefinedNetwork %s/%s", c.namespace, c.name))
Expect(applyManifest(c.namespace, udnManifest)).To(Succeed())
Expect(waitForUserDefinedNetworkReady(c.namespace, c.name, udnCrReadyTimeout)).To(Succeed())
return networkAttachmentConfig{networkAttachmentConfigParams{networkName: c.namespace + "." + c.name}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi how can I have access to the name of the network ?

Right now, the only thing I think I can do is to read the corresponding NAD, unmarshal, and return that ...

Is there something simpler ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npinaeva how about you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetched it from the NAD generated for the UDN.

I think we can assume the name is the same (i.e. NAD.metadata.Name == UDN.metadata.Name)

If that changes, we'll need to be able to derive it somehow. Not thinking much about that, because the change must be backwards compatible.

@npinaeva

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fetch the NAD that created by the UDN controller and parse the network-name it got.
Doing so ensure tests wont break even if network-name will be changed, no need to go back to tests and do some adjustments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would make sense. I have the owner reference for that, right ? @ormergi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can also fetch the NAD with the meta.name as the UDN object as well

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 9, 2024

/test e2e-aws-ovn-virt-techpreview

@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch 2 times, most recently from 3961085 to f009222 Compare December 9, 2024 16:55
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 9, 2024

/test e2e-aws-ovn-virt-techpreview

Copy link

openshift-trt bot commented Dec 9, 2024

Job Failure Risk Analysis for sha: f009222

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (2051): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (22) are below the historical average (425): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (22) are below the historical average (949): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd Medium
[bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info
This test has passed 93.44% of 61 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-network][OCPFeatureGate:PersistentIPsForVirtualization][Feature:Layer2LiveMigration] Kubevirt Virtual Machines when using openshift ovn-kubernetes with user defined networks and persistent ips configured created using UserDefinedNetwork [Suite:openshift/network/virtualization] should keep ip when the VM attached to a primary UDN is restarted
This test has passed 75.00% of 4 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 4 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 10, 2024

/test e2e-aws-ovn-virt-techpreview

@deads2k
Copy link
Contributor

deads2k commented Dec 10, 2024

/hold

fix the product, not the test.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2024
Copy link

openshift-trt bot commented Dec 10, 2024

Job Failure Risk Analysis for sha: 3e93667

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (1692): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (22) are below the historical average (363): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (22) are below the historical average (807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 5 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

Comment on lines 486 to 507
out, err = runOcWithRetry(oc.AsAdmin(), "get",
"pods",
"-o", "name",
"-n", "openshift-ovn-kubernetes",
"--field-selector", fmt.Sprintf("spec.nodeName=%s", nodeName),
"-l", "app=ovnkube-node")
if err != nil {
return false, err
}
outReader := bufio.NewScanner(strings.NewReader(out))
re := regexp.MustCompile("^pod/(.*)")
for outReader.Scan() {
match := re.FindSubmatch([]byte(outReader.Text()))
if len(match) != 2 {
continue
}
podName = string(match[1])
break
}
if podName == "" {
return false, fmt.Errorf("could not find a valid ovnkube-node pod on node '%s'", nodeName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this code looks for ovnkube-node pod name, cant it be done using the programmatic go-client (similar to line 508)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but I just copied this code from somewhere else - I still need to make this coalesce.

And still, I guess staff engineering opposes this PR: #29348 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code. At least now it's clear I'm re-using existent code that can be improved on a follow-up.

@deads2k
Copy link
Contributor

deads2k commented Dec 12, 2024

And still, I guess staff engineering opposes this PR: #29348 (comment)

How about creating a copy of this test that is showing the UDN problem and making it minimum reproducer. Then gate that test on UDN. This ensures that the failure mode is obvious and we don't lose the signal. With the signal preserved, it is then ok to adjust other tests to compensate for a product bug because the signal won't be lost and will block the promotion of UDN.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Otherwise, we risk facing a situation where the primary UDN will not be
plumbed into the worlkload.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
This way, we can better debug when things go wrong.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
This way, we can re-use this logic in other places without duplicating
code

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from 3e93667 to f2451e8 Compare December 13, 2024 16:18
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 16, 2024

/test e2e-aws-ovn-virt-techpreview

How about creating a copy of this test that is showing the UDN problem and making it minimum reproducer. Then gate that test on UDN. This ensures that the failure mode is obvious and we don't lose the signal. With the signal preserved, it is then ok to adjust other tests to compensate for a product bug because the signal won't be lost and will block the promotion of UDN.

Did this in the top most commit: Isolated the raceful conditions in a test tagged w/ the UDN FG.
The other tests (not this newly introduced test) compensate for the product bug, but this particular one doesn't.

@deads2k

Let's see if CI likes the patch ...

Copy link

openshift-trt bot commented Dec 16, 2024

Job Failure Risk Analysis for sha: 5fd52b8

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (822): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 17, 2024

@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from 5fd52b8 to db5ca09 Compare December 17, 2024 17:11
… race is avoided

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from db5ca09 to 69491f0 Compare December 17, 2024 18:06
Copy link
Contributor

openshift-ci bot commented Dec 17, 2024

@maiqueb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-virt-techpreview 5fd52b8 link false /test e2e-aws-ovn-virt-techpreview
ci/prow/e2e-aws-ovn-single-node-upgrade 69491f0 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 69491f0 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 69491f0 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/okd-scos-e2e-aws-ovn 69491f0 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented Dec 18, 2024

Thanks for adding the targetted test

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2024
@deads2k
Copy link
Contributor

deads2k commented Dec 18, 2024

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024
@maiqueb maiqueb changed the title Only provision workloads when network creation has started SDN-5568: only provision workloads when network creation has started Dec 18, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 18, 2024

@maiqueb: This pull request references SDN-5568 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 18, 2024
@knobunc
Copy link
Contributor

knobunc commented Dec 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2024
Copy link
Contributor

openshift-ci bot commented Dec 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, knobunc, maiqueb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD edec610 and 2 for PR HEAD 69491f0 in total

Copy link

openshift-trt bot commented Dec 18, 2024

Job Failure Risk Analysis for sha: 69491f0

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (442): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 57.14% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@openshift-merge-bot openshift-merge-bot bot merged commit e3f94ae into openshift:master Dec 19, 2024
23 of 29 checks passed
@maiqueb maiqueb deleted the only-provision-workloads-when-network-creation-has-started branch December 19, 2024 11:18
@jluhrsen
Copy link
Contributor

jluhrsen commented Jan 9, 2025

/cherry-pick release-4.18

@openshift-cherrypick-robot

@jluhrsen: new pull request created: #29413

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants