Description
Describe the bug
If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.
To Reproduce
- Import the basket with
icingacli director basket restore < icinga-lost-statechange-basket.json
- icinga-lost-statechange-basket.json
- This basket contains:
- A check command that randomly goes into warning to generate the state changes.
- A service template that runs the check command
- A service group to quickly create lots of services, making the occurrence more likely.
- A host template as the target for the apply rule of the service group.
- Create a few hosts:
-
for i in $(seq --equal-width 1 100); do icingacli director host create "host-icinga-lost-statehistory-${i}" --imports 'ht-icinga-lost-statechange' done
-
- Deploy the config
icingacli director config deploy
With that configuration running, deploy icinga2 a few times: icingacli director config deploy --force --wait
Soon there will be state changes in the state history that should not be possible:
In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.
To find lost state histories quicker I used the following script:
dropped_state_query.tar.gz
It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the --postgres
flag.
Expected behavior
I expect icinga2 to not loose state changes like that.
Your Environment
Include as many relevant details about the environment you experienced the problem in
- Version used (
icinga2 --version
):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)
Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Build information:
Compiler: GNU 8.5.0
Build host: staging5591master
OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
- Operating System and version:
System information:
Platform: Red Hat Enterprise Linux
Platform version: 8.10 (Ootpa)
Kernel: Linux
Kernel version: 4.18.0-553.el8_10.x86_64
Architecture: x86_64
- Enabled features (
icinga2 feature list
):
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 journald opentsdb perfdata syslog
Enabled features: api checker ido-mysql influxdb livestatus mainlog notification
- Icinga Web 2 version and modules (System - About):
Icinga Web 2 NetEye release 4.39 (Traditional bock)
PHP Version 7.4.33
MODULE VERSION
analytics 1.58.0
auditlog 1.15.1
cube 1.1.0
customproblemview 0.0.0
director 1.11.1
geomap 1.22.0
grafana 1.4.2
neteye 1.155.0-1
host2servicedetailview 1.4.0
idoreports 0.10.1
incubator 0.22.0
ipl v0.5.0
lampo 1.2.2
leafletjs 1.9.4
loginaudit 0.0.1
mapDatatype 0.1.0
monitoring 2.10.5
monitoringview 1.7.0
nagvis 1.1.1
pdfexport 0.10.2
reactbundle 0.9.0
reporting 1.0.0
shutdownmanager 0.0.0
srwebbackend 0.0.0
tornado 2.19.2
update 1.44.1-2
- Config validation (
icinga2 daemon -C
):
[2024-09-30 14:45:34 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-09-30 14:45:34 +0200] information/cli: Loading configuration file(s).
[2024-09-30 14:45:34 +0200] information/ConfigItem: Committing config item(s).
[2024-09-30 14:45:34 +0200] information/ApiListener: My API identity: localhost.localdomain
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 LivestatusListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 902 Services.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 3 Zones.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 2 NotificationCommands.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 101 Hosts.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 Endpoint.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 251 CheckCommands.
[2024-09-30 14:45:34 +0200] information/ScriptGlobal: Dumping variables to file '/neteye/shared/icinga2/data/cache/icinga2/icinga2.vars'
[2024-09-30 14:45:34 +0200] information/cli: Finished validating the configuration file(s).
Additional context
I could observe the loss of notifications in production, have however not yet reproduced that behavior locally. I suspect however that the two behavior are linked.
We could also observe the same behavior when creating objects over the icinga2 api and then immediately sending a check-result. Once again, I have not replicated this locally yet, but I suspect the problem is the same in all these cases.