Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dual-ToR] mux container can't restart due to high CPU usage after 'config reload -y -f' when feature autorestart is disabled #20414

Open
ayurkiv-nvda opened this issue Oct 3, 2024 · 1 comment
Assignees
Labels
MSFT Triaged this issue has been triaged

Comments

@ayurkiv-nvda
Copy link
Contributor

ayurkiv-nvda commented Oct 3, 2024

Description

mux container can't restart due to high CPU usage after 'config reload -y -f' when feature autorestart is disabled

Steps to reproduce the issue:

  1. disable autorestart:
    config feature autorestart mux disabled
  2. config save -y
  3. run bash script:
# Number of virtual CPUs
host_vcpus=8

# Start CPU load on all CPUs
echo "Generating high CPU load..."
for i in $(seq 1 $host_vcpus); do
    nohup yes > /dev/null 2>&1 &   # Start 'yes' command in background
    sleep 1                        # Pause for 1 second between each process start
done

# Reload configuration
echo "Reloading configuration..."
config reload -y -f
sleep 120                         # wait until all docker are up

echo "Killing CPU load processes..."
killall yes

echo "CPU load processes terminated."

Script was taken from sonic-mgmt test pc/test_po_cleanup.py::test_po_cleanup_after_reload which can help to reproduce issue with 100% rate
4. check docker ps -a

Describe the results you received:

root@sonic:/var/log# zgrep "Timed out waiting for tunnel MuxTunnel0\|Failed to start mux\|stuck in namespace 'host'\|Failed to start MUX Cable\|config relo                                                                                                                                                     ad -y" syslog
2024 Oct  3 12:35:30.272879 sonic NOTICE switch_hash: 'reload' executing with command: config reload -y -f
2024 Oct  3 12:37:52.992528 sonic ERR systemd[1]: Failed to start mux.service - MUX Cable Container.
2024 Oct  3 12:38:34.687532 sonic WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

Sometimes error message ERR write_standby: Timed out waiting for tunnel MuxTunnel0, mux state will not be written can be seen

mux docker always down

 docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS                     PORTS     NAMES
4d12bfe1c78c   docker-snmp:latest                   "/usr/local/bin/supe…"   13 minutes ago   Up 2 minutes                         snmp
cc16c2bb6df8   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   14 minutes ago   Up 2 minutes                         mgmt-framework
7642c698c649   docker-lldp:latest                   "/usr/bin/docker-lld…"   14 minutes ago   Up 2 minutes                         lldp
226162f59821   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   14 minutes ago   Up 2 minutes                         gnmi
ace679a74cc9   docker-mux:latest                    "/usr/bin/docker-ini…"   14 minutes ago   Exited (0) 5 minutes ago             mux
d63bc1a9fa7c   642c1fefce18                         "/usr/bin/docker_ini…"   14 minutes ago   Up 3 minutes                         dhcp_relay
7e8171f39028   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   15 minutes ago   Up 3 minutes                         pmon
33a1e000fa15   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   15 minutes ago   Up 3 minutes                         radv
4a170f6648f2   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   15 minutes ago   Up 3 minutes                         syncd
0a0649f9d752   docker-teamd:latest                  "/usr/local/bin/supe…"   15 minutes ago   Up 4 minutes                         teamd
4e798ffe561a   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   15 minutes ago   Up 4 minutes                         bgp
003a50c85591   docker-orchagent:latest              "/usr/bin/docker-ini…"   15 minutes ago   Up 4 minutes                         swss
934bf78c50e2   docker-eventd:latest                 "/usr/local/bin/supe…"   15 minutes ago   Up 4 minutes                         eventd
331c59132b5d   docker-database:latest               "/usr/local/bin/dock…"   16 minutes ago   Up 16 minutes                        database

Describe the results you expected:

mux docker should be up and running

Output of show version:

SONiC Software Version: SONiC.202405.4-efde59958_Internal
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-11-2-amd64
Build commit: efde59958
Build date: Mon Aug 19 13:36:58 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-24

Reproduced also on latest 202405_RC.21

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@judyjoseph judyjoseph added Triaged this issue has been triaged MSFT labels Oct 9, 2024
@zjswhhh
Copy link
Contributor

zjswhhh commented Nov 4, 2024

Hi @ayurkiv-nvda- trying to understand the context here. Wouldn't it be expected if we have autorestart disabled?

Was it a test failure you were debugging? What was the scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants