Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't log NodeReady is true and false in same reconcile loop #3112

Merged
merged 3 commits into from
Sep 4, 2024

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Aug 27, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#7738

What this PR does / why we need it:

Key change:

  • A defer statement and a nodeReady conditional variable now ensure we don't record ready (i.e. log and emit an event) during the check for a manager pod just to later record not ready during the Kubernetes node status check.

Additional changes:

  • A nil pointer exception used to be possible in this controller, though the severity and potential to hit it was extremely low. (It probably required a longhorn-manager pod running on a node that was actively being removed from the cluster. In this situation, a crash is probably fine.) The exception should no longer be possible.
  • The logic for setting Ready and Schedulable is now broken out into a separate function.

@ejweber
Copy link
Contributor Author

ejweber commented Aug 27, 2024

Before:

  1. Install Longhorn.
  2. Watch longhorn-manager logs.
  3. SSH to a Longhorn node.
  4. Stop kubelet.
  5. Wait for confusing logs that show Ready fluctuate between true and false (after about thirty seconds).
root@eweber-v126-worker-9c1451b4-6464j:~# systemctl stop k3s-agent

eweber@laptop:~/longhorn> kubetail -n longhorn-system -l app=longhorn-manager | grep Ready
[longhorn-manager-7dp4j longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617346\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-bth5g longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617346\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-7dp4j longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617346\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-7dp4j longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617771\", FieldPath:\"\"}): type: 'Normal' reason: 'Ready' Node eweber-v126-worker-9c1451b4-6464j is ready" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-7dp4j longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617771\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-bth5g longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617771\", FieldPath:\"\"}): type: 'Normal' reason: 'Ready' Node eweber-v126-worker-9c1451b4-6464j is ready" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-bth5g longhorn-manager] time="2024-08-26T22:05:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145617771\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"

After:

  1. Install Longhorn.
  2. Watch longhorn-manager logs.
  3. SSH to a Longhorn node.
  4. Stop kubelet.
  5. Only see logs that show Ready go to false (after about thirty seconds).
  6. Start kubelet.
  7. Verify the Longhorn node shows ready again.
root@eweber-v126-worker-9c1451b4-6464j:~# systemctl stop k3s-agent

eweber@laptop:~/longhorn> kubetail -n longhorn-system -l app=longhorn-manager | grep Ready
[longhorn-manager-88hwg longhorn-manager] time="2024-08-27T15:23:03Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145933310\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
[longhorn-manager-k2lpp longhorn-manager] time="2024-08-27T15:23:03Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"eweber-v126-worker-9c1451b4-6464j\", UID:\"04801ad4-91bb-480b-94e4-86638500369d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"145933310\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node eweber-v126-worker-9c1451b4-6464j not ready: NodeStatusUnknown" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"

james-munson
james-munson previously approved these changes Aug 27, 2024
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has bugged me in RWX HA testing, too. Glad to see it addressed. LGTM.
(I assume that when kubelet is restored, or the node restarts, it does return to Ready.)

@ejweber
Copy link
Contributor Author

ejweber commented Aug 29, 2024

It looks like the CI failure is legitimate. I will investigate.

It looks like the CI failure is a result of expecting the duplicate logging. I am reworking the tests slightly.

@ejweber
Copy link
Contributor Author

ejweber commented Aug 29, 2024

(I assume that when kubelet is restored, or the node restarts, it does return to Ready.)

I does, but it is good to add this to the test plan above. I will do that.

james-munson
james-munson previously approved these changes Aug 29, 2024
derekbit
derekbit previously approved these changes Sep 2, 2024
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

controller/node_controller.go Outdated Show resolved Hide resolved
controller/node_controller.go Outdated Show resolved Hide resolved
controller/node_controller.go Show resolved Hide resolved
@ejweber ejweber marked this pull request as draft September 3, 2024 20:09
@ejweber ejweber dismissed stale reviews from derekbit and james-munson via b845eeb September 3, 2024 20:09
Longhorn 7738

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 7738

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 7738

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber marked this pull request as ready for review September 3, 2024 21:34
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit derekbit merged commit 9ab53c8 into longhorn:master Sep 4, 2024
8 checks passed
@ejweber
Copy link
Contributor Author

ejweber commented Sep 4, 2024

@mergify backport v1.7.x v1.6.x

Copy link

mergify bot commented Sep 4, 2024

backport v1.7.x v1.6.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants