Skip to content

Bug: gnoi-shutdown service blocks SYSTEM_READY on non-SmartSwitch platforms #25554

@anders-nexthop

Description

@anders-nexthop

Is it platform specific

generic

Importance or Severity

High

Description of the bug

The system-health service (healthd) is responsible for setting SYSTEM_READY|SYSTEM_STATE to "UP" in Redis STATE_DB. It does this by:

 1. Monitoring all services listed in the FEATURE table (CONFIG_DB) that have check_up_status=true
 2. Checking their status in ALL_SERVICE_STATUS (STATE_DB)
 3. Only setting SYSTEM_READY to "UP" when all monitored services report status="Up"

There is a service gnoi-shutdown that is only for DPU devices which is showing up as being in a bad state on non-DPU devices, leading healthd to never mark SYSTEM_READY to UP

This bug was discovered while debugging flaky sflow tests in sonic-mgmt. The tests would pass when run with sufficient delay after boot, but fail when run immediately, because hsflowd was waiting for
SYSTEM_READY.

Investigation revealed that gnoi-shutdown is only relevant for SmartSwitch platforms with DPUs, but is being monitored for system readiness on all platforms.

From sonic-buildimage/src/system-health/health_checker/sysmonitor.py:

   def get_all_system_status(self):
       """ Shows the system ready status"""
       scan_srv_list = []
       scan_srv_list = self.get_all_service_list()
       for service in scan_srv_list:
           ustate = self.get_unit_status(service)
           if ustate == "NOT OK":
               if service not in self.dnsrvs_name:
                   self.dnsrvs_name.add(service)
       if len(self.dnsrvs_name) == 0:
           return "UP"
       else:
           return "DOWN"

The gnoi-shutdown service has an ExecCondition that only allows it to run on SmartSwitch platforms with DPUs.

From sonic-buildimage/src/sonic-host-services/data/debian/sonic-host-services-data.gnoi-shutdown.service:

   [Service]
   Type=simple
   ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py
   ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
   ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py

On non-SmartSwitch platforms:

 1. ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py fails (returns non-zero exit code)
 2. systemd marks the service as failed with fail_reason="exec-condition"
 3. system-health sees gnoi-shutdown in ALL_SERVICE_STATUS with:
  `app_ready_status: Down`
  `fail_reason: exec-condition`
  `service_status: Down`
 4. system-health considers this a "NOT OK" service
 5. system-health refuses to set SYSTEM_READY to "UP"
 6. Services waiting for SYSTEM_READY (like hsflowd) are stuck

gnoi-shutdown is a platform-specific service designed for SmartSwitch DPU graceful shutdown. It should NOT be monitored for system readiness on platforms where it cannot run.

The service has a valid ExecCondition that prevents it from running on incompatible platforms, but system-health treats this as a service failure rather than a platform-specific exclusion.

Steps to Reproduce

  1. Boot a non-SmartSwitch SONiC device

  2. Check SYSTEM_READY status:
    redis-cli -n 6 HGET "SYSTEM_READY|SYSTEM_STATE" Status
    Expected: Should be "UP" after all services start
    Actual: Returns (nil) (not set)

  3. Check gnoi-shutdown status:
    redis-cli -n 6 HGETALL "ALL_SERVICE_STATUS|gnoi-shutdown"
    Actual: Shows fail_reason: exec-condition, service_status: Down

  4. Enable sflow and observe startup delay (hsflowd will not send any samples during the delay period, even though it's configured correctly and running):
    config sflow enable

Actual Behavior and Expected Behavior

Expected:
sflow feature should start working shortly after being enabled and configured

Actual:
After being enabled/configured, no sample packets are captured by sflow for 180s (until after the timeout for the SYSTEM_READY check expires)

Relevant log output

Output of show version, show techsupport

Attach files (if any)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions