Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pod log streaming to monitor for etcd so we see all intervals #28243

Merged

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Sep 5, 2023

ought to handle

  1. pod creation
  2. pod deletion
  3. pod restart
  4. pod replacement (static pods)
  5. kubelet restart
  6. node restart

@openshift-ci openshift-ci bot requested review from jwforres and spadgett September 5, 2023 22:16
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2023
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: f44ee81

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (23) are below the historical average (1239): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade IncompleteTests
Tests for this run (102) are below the historical average (594): IncompleteTests
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (27) are below the historical average (584): IncompleteTests
pull-ci-openshift-origin-master-e2e-metal-ipi-sdn Medium
[sig-api-machinery] Aggregator Should be able to support the 1.17 Sample API Server using the current Aggregator [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
This test has passed 96.98% of 2520 runs on release 4.14 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-gcp-csi Medium
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers
This test has passed 90.32% of 31 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-ovn-csi'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Medium
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 86.67% of 30 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node Medium
[sig-arch] events should not repeat pathologically for ns/openshift-console-operator
This test has passed 96.15% of 26 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-csi Medium
External Storage [Driver: ebs.csi.aws.com] [Testpattern: Ephemeral Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works, check deletion (ephemeral)
This test has passed 96.88% of 32 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-csi'] in the last 14 days.

@deads2k deads2k force-pushed the invariant-63-stream-logs branch from 4648b43 to f086693 Compare September 6, 2023 19:34
c.watchers = map[podKey]*watcher{}
}

// Run starts the controller and blocks until stopCh is closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I realize this is probably boilerplate but stopCh here refers to ctx>Done() right? Thought it was finishedCleanup but that's just the signal for a clean shutdown I think.

Comment on lines +226 to +227
// TODO set a timeout?
c.removeAllWatchers(context.TODO())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seconds? Should be enough for all the pods in the etcd namespace?

ctx, cancel := context.WithTimeout(ctx, time.Duration(time.Second*10))
defer cancel()
c.removeAllWatchers(ctx)

Although the cluster is going away so it may not matter if we don't finish cleanup.


var (
// "raft.node: 38360899e3c7337e elected leader d8a2c1adbed17efe at term 6"
electedLeaderRegex = regexp.MustCompile("elected leader (?P<CURR_LEADER>[a-z0-9.-]+) at term (?P<TERM>[0-9]+)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL named capture groups.

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Still grasping the interval builder bits but the log streaming and matching looks good.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 139702f and 2 for PR HEAD f086693 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 15fb406 and 1 for PR HEAD f086693 in total

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: f086693

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-openstack-ovn High
[sig-installer][Suite:openshift/openstack][lb][Serial] The Openstack platform should limit service access on an UDP Amphora LoadBalancer when an UDP LoadBalancer svc setting the loadBalancerSourceRanges spec is created on Openshift
This test has passed 100.00% of 7 runs on release 4.15 [amd64 ha openstack ovn] in the last week.
---
[sig-installer][Suite:openshift/openstack][lb][Serial] The Openstack platform should create an UDP Amphora LoadBalancer using a pre-created FIP when an UDP LoadBalancer svc setting the LoadBalancerIP spec is created on Openshift
This test has passed 100.00% of 7 runs on release 4.15 [amd64 ha openstack ovn] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial High
[sig-auth][Feature:ProjectAPI][Serial] TestUnprivilegedNewProjectDenied [apigroup:authorization.openshift.io][apigroup:project.openshift.io] [Suite:openshift/conformance/serial]
This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-cli] oc status can show correct status after switching between projects [apigroup:project.openshift.io][apigroup:image.openshift.io][Serial] [Suite:openshift/conformance/serial]
This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-network] services when running openshift ipv4 cluster ensures external ip policy is configured correctly on the cluster [apigroup:config.openshift.io] [Serial] [Suite:openshift/conformance/serial]
This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node Medium
[sig-network] can collect pod-to-host poller pod logs
This test has passed 97.94% of 4716 runs on release 4.15 [Overall] in the last week.
---
[sig-network] can collect host-to-host poller pod logs
This test has passed 97.90% of 4715 runs on release 4.15 [Overall] in the last week.

@deads2k deads2k force-pushed the invariant-63-stream-logs branch from f086693 to da60f02 Compare September 19, 2023 23:08
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2023

New changes are detected. LGTM label has been removed.

@deads2k deads2k added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
@deads2k
Copy link
Contributor Author

deads2k commented Sep 19, 2023

simple rebase, reapplying lgtm

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD ffd16ac and 2 for PR HEAD da60f02 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 90bcab0 and 1 for PR HEAD da60f02 in total

@deads2k
Copy link
Contributor Author

deads2k commented Sep 21, 2023

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 21, 2023

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn da60f02 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-upgrade da60f02 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-single-node-upgrade da60f02 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn-rt-upgrade da60f02 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-csi da60f02 link false /test e2e-aws-csi
ci/prow/e2e-aws-ovn-single-node-serial da60f02 link false /test e2e-aws-ovn-single-node-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a2d29ea and 0 for PR HEAD da60f02 in total

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: da60f02

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial High
[sig-storage][Feature:DisableStorageClass][Serial][apigroup:operator.openshift.io] should not reconcile the StorageClass when StorageClassState is Unmanaged [Suite:openshift/conformance/serial]
This test has passed 100.00% of 31 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-imageregistry][Feature:ImageTriggers][Serial] ImageStream admission TestImageStreamAdmitStatusUpdate [apigroup:image.openshift.io] [Suite:openshift/conformance/serial]
This test has passed 100.00% of 31 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-openstack-ovn IncompleteTests
Tests for this run (97) are below the historical average (1338): IncompleteTests
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (19) are below the historical average (661): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade IncompleteTests
Tests for this run (24) are below the historical average (609): IncompleteTests
pull-ci-openshift-origin-master-e2e-aws-csi IncompleteTests
Tests for this run (22) are below the historical average (599): IncompleteTests

@openshift-merge-robot openshift-merge-robot merged commit 086eac3 into openshift:master Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants