Skip to content

[RFC] How to deal in handlers with check exit status out of bounds #127

@kali-brandwatch

Description

@kali-brandwatch

I recently launched a PR for handler-slack-multichannel.rb (see here) in which I suggested changing the approach on how to deal with exit codes from checks that are out of bounds (out of the [0-3] range).

In the discussion that started in the PR I suggested following what (at least in my experience) has been a standard from nagios conventions into pretty much anything I used that follows the nagios path (including icinga 1&2 and sensu), that when a check script returns a non [0-3] exit code it should not be masked as UNKNOWN status.

The logic behind this approach is the following:

  • typically the person who coded the script will add a specific exit status code 3 for scenarios in which the check can not assess the state of the service or resource monitored, meaning the code is purposely catching a result.
  • check scripts however die ungracefully due to some non expected reasons, and therefore it will return whatever exit code its execution might produce.
  • common causes for such non-standard (in my experience) often include:
    • check script depends on sudo privileges, but the relevant sudoers entry doesn't exist/has been removed
    • check script has wrong permissions (wrong ownership, wrong chmod)
    • check script is trying to read from a file that does not exist / has no permissions to
    • check script runs some command (e.g. curl) that exits in error code and is not caught
    • syntax errors introduced in the check script
    • in distributed config management systems (puppet et.al.) a distributed working team might introduce one of the above accidentally breaking a check's functionality

If we follow the convention of returning only standard [0-3] status codes, we might be masking any of the above under an unknown error message.

In my personal experience it has always been useful to be able to see the explicit error code and report it to the operator, since this helps identifying the issue faster and clearer.

Alternatively, I would have this as a configurable option that would allow handlers to set an override for any non-standard status code as whatever we prefer (some people would like to have it as UNKNOWN but I have also been in scenarios where it was preferred to report such failures as CRITICAL).

I would like to hear more opinions regarding this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions