Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add journalctl logs to failed systemctl commands #17659

Merged
merged 1 commit into from
Nov 28, 2023

Conversation

spowelljr
Copy link
Member

@spowelljr spowelljr commented Nov 22, 2023

Problem

Users (and our testing infra) occasionally run into flakey systemctl related failures when docker and cri-docker are being restarted. The error from systemctl doesn't include any useful information and outputs to run journalctl -xeu <service> to get the error logs. In our testing infra this is impossible as the clusters and already deleted when the logs complete, and due to the flakiness it's hard to reproduce a failure and users aren't able to generate use the logs. We don't know the cause of these errors so we can't further debug.

Example:

😄  minikube v1.31.2 on
✨  Using the docker driver based on user configuration
💨  For improved Docker performance, enable the overlay Linux kernel module using 'modprobe overlay'
❗  docker is currently using the btrfs storage driver, setting preload=false
📌  Using Docker driver with root privileges
👍  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
🔥  Creating docker container (CPUs=2, Memory=3900MB) ...

❌  Exiting due to RUNTIME_ENABLE: Failed to enable container runtime: sudo systemctl restart docker: Process exited with status 1
stdout:

stderr:
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xeu docker.service" for details.


╭───────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                           │
│    😿  If the above advice does not help, please let us know:                             │
│    👉  https://github.com/kubernetes/minikube/issues/new/choose                           │
│                                                                                           │
│    Please run `minikube logs --file=logs.txt` and attach logs.txt to the GitHub issue.    │
│                                                                                           │
╰───────────────────────────────────────────────────────────────────────────────────────────╯

Solution

I've wrapped the major systemctl commands with a function that checks if the output was successful. If it does there are no changes, but if the systemctl command did fail we call sudo journalctl --no-pager -u <service> to get the logs and then append them to the error so no extra work is needed to get the systemctl error logs.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 22, 2023
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2023
@spowelljr
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 22, 2023
@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: medyagh, spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@medyagh
Copy link
Member

medyagh commented Nov 23, 2023

/ok-to-test

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17659) |
+----------------+----------+---------------------+
| minikube start | 50.8s    | 52.1s               |
| enable ingress | 27.1s    | 27.0s               |
+----------------+----------+---------------------+

Times for minikube start: 50.9s 49.7s 50.9s 48.3s 54.2s
Times for minikube (PR 17659) start: 50.7s 52.3s 54.4s 52.5s 50.4s

Times for minikube ingress: 26.7s 28.2s 28.2s 25.1s 27.2s
Times for minikube (PR 17659) ingress: 27.6s 26.6s 28.7s 28.1s 24.2s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17659) |
+----------------+----------+---------------------+
| minikube start | 23.0s    | 23.7s               |
| enable ingress | 20.8s    | 21.3s               |
+----------------+----------+---------------------+

Times for minikube ingress: 20.8s 20.8s 18.9s 22.8s 20.8s
Times for minikube (PR 17659) ingress: 21.8s 20.9s 20.8s 20.8s 22.4s

Times for minikube start: 24.2s 22.3s 21.9s 24.6s 22.0s
Times for minikube (PR 17659) start: 22.7s 21.9s 24.1s 25.3s 24.4s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17659) |
+----------------+----------+---------------------+
| minikube start | 23.3s    | 21.9s               |
| enable ingress | 26.6s    | 29.3s               |
+----------------+----------+---------------------+

Times for minikube ingress: 19.4s 19.3s 31.3s 31.3s 31.4s
Times for minikube (PR 17659) ingress: 32.3s 31.3s 19.4s 32.3s 31.3s

Times for minikube start: 24.6s 23.6s 23.5s 21.4s 23.4s
Times for minikube (PR 17659) start: 20.1s 20.9s 23.5s 23.7s 21.4s

@minikube-pr-bot
Copy link

These are the flake rates of all failed tests.

Environment Failed Tests Flake Rate (%)
Docker_macOS TestMountStart/serial/VerifyMountPostStop (gopogh) 7.75 (chart)
Docker_macOS TestCertOptions (gopogh) 13.33 (chart)
Docker_macOS TestDockerFlags (gopogh) 13.33 (chart)
Docker_macOS TestForceSystemdEnv (gopogh) 13.33 (chart)
Docker_macOS TestForceSystemdFlag (gopogh) 13.33 (chart)
Docker_macOS TestInsufficientStorage (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/AddNode (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/CopyFile (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/DeleteNode (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/DeployApp2Nodes (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/FreshStart2Nodes (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/PingHostFrom2Pods (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/ProfileList (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/RestartKeepsNodes (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/RestartMultiNode (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/StartAfterStop (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/StopMultiNode (gopogh) 15.03 (chart)
Docker_macOS TestMultiNode/serial/StopNode (gopogh) 15.03 (chart)
Docker_macOS TestOffline (gopogh) 15.03 (chart)
Docker_macOS TestScheduledStopUnix (gopogh) 15.03 (chart)
Docker_macOS TestSkaffold (gopogh) 15.03 (chart)

To see the flake rates of all tests by environment, click here.

@prezha
Copy link
Contributor

prezha commented Nov 28, 2023

@spowelljr i think this is a good call - at lest until we get to the bottom of these flakes
we should fix the unit tests, though, as they are panic-ing now b/c of the additional (unexpected) command (the journalctl)
i remember seeing some cr-related services giving up after three retries because the dependent service was not yet up

@medyagh medyagh merged commit 3992eaf into kubernetes:master Nov 28, 2023
33 of 50 checks passed
@spowelljr spowelljr deleted the addJournalctlLogs branch November 28, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants