-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is it platform specific
generic
Importance or Severity
High
Description of the bug
The system-health service (healthd) is responsible for setting SYSTEM_READY|SYSTEM_STATE to "UP" in Redis STATE_DB. It does this by:
1. Monitoring all services listed in the FEATURE table (CONFIG_DB) that have check_up_status=true
2. Checking their status in ALL_SERVICE_STATUS (STATE_DB)
3. Only setting SYSTEM_READY to "UP" when all monitored services report status="Up"
There is a service gnoi-shutdown that is only for DPU devices which is showing up as being in a bad state on non-DPU devices, leading healthd to never mark SYSTEM_READY to UP
This bug was discovered while debugging flaky sflow tests in sonic-mgmt. The tests would pass when run with sufficient delay after boot, but fail when run immediately, because hsflowd was waiting for
SYSTEM_READY.
Investigation revealed that gnoi-shutdown is only relevant for SmartSwitch platforms with DPUs, but is being monitored for system readiness on all platforms.
From sonic-buildimage/src/system-health/health_checker/sysmonitor.py:
def get_all_system_status(self):
""" Shows the system ready status"""
scan_srv_list = []
scan_srv_list = self.get_all_service_list()
for service in scan_srv_list:
ustate = self.get_unit_status(service)
if ustate == "NOT OK":
if service not in self.dnsrvs_name:
self.dnsrvs_name.add(service)
if len(self.dnsrvs_name) == 0:
return "UP"
else:
return "DOWN"
The gnoi-shutdown service has an ExecCondition that only allows it to run on SmartSwitch platforms with DPUs.
From sonic-buildimage/src/sonic-host-services/data/debian/sonic-host-services-data.gnoi-shutdown.service:
[Service]
Type=simple
ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
On non-SmartSwitch platforms:
1. ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py fails (returns non-zero exit code)
2. systemd marks the service as failed with fail_reason="exec-condition"
3. system-health sees gnoi-shutdown in ALL_SERVICE_STATUS with:
`app_ready_status: Down`
`fail_reason: exec-condition`
`service_status: Down`
4. system-health considers this a "NOT OK" service
5. system-health refuses to set SYSTEM_READY to "UP"
6. Services waiting for SYSTEM_READY (like hsflowd) are stuck
gnoi-shutdown is a platform-specific service designed for SmartSwitch DPU graceful shutdown. It should NOT be monitored for system readiness on platforms where it cannot run.
The service has a valid ExecCondition that prevents it from running on incompatible platforms, but system-health treats this as a service failure rather than a platform-specific exclusion.
Steps to Reproduce
-
Boot a non-SmartSwitch SONiC device
-
Check SYSTEM_READY status:
redis-cli -n 6 HGET "SYSTEM_READY|SYSTEM_STATE" Status
Expected: Should be "UP" after all services start
Actual: Returns (nil) (not set) -
Check gnoi-shutdown status:
redis-cli -n 6 HGETALL "ALL_SERVICE_STATUS|gnoi-shutdown"
Actual: Shows fail_reason: exec-condition, service_status: Down -
Enable sflow and observe startup delay (hsflowd will not send any samples during the delay period, even though it's configured correctly and running):
config sflow enable
Actual Behavior and Expected Behavior
Expected:
sflow feature should start working shortly after being enabled and configured
Actual:
After being enabled/configured, no sample packets are captured by sflow for 180s (until after the timeout for the SYSTEM_READY check expires)
Relevant log output
Output of show version, show techsupport
Attach files (if any)
No response