Skip to content

Bugfix: configobject shutdown order #10191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

w1ll-i-code
Copy link

This PR fixes #10179.

Shutdown the object in the correct order.

During a restart the objects are shut down in the reversed order they should be. With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Increase activation_priority for key config objects to ensure correct startup and shutdown order

Make sure all configobjects are started and shut down in the right order, so that each object has its possible dependencies loaded before starting execution.

Copy link

cla-bot bot commented Oct 17, 2024

Thank you for your pull request. Before we can look at it, you'll need to sign a Contributor License Agreement (CLA).

Please follow instructions at https://icinga.com/company/contributor-agreement to sign the CLA.

After that, please reply here with a comment and we'll verify.

Contributors that have not signed yet: @w1ll-i-code

  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Please contact us if you think this is the case.

  • If you signed the CLA as a corporation, your GitHub username may not have been submitted to us. Please reach out to the responsible person in your organization.

@w1ll-i-code w1ll-i-code changed the title Bugfix/configobject shutdown order 10179 Bugfix: configobject shutdown order Oct 17, 2024
@w1ll-i-code
Copy link
Author

How do I retrigger the CLA check?

@bobapple
Copy link
Member

@cla-bot check

@cla-bot cla-bot bot added the cla/signed label Oct 17, 2024
@lippserd
Copy link
Member

Hi @w1ll-i-code,

Thanks for the PR. We will look at the changes as soon as possible.

Best regards,
Eric

@yhabteab
Copy link
Member

Hi, thank you for your contribution!

With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Can you confirm that you have tested this PR and that it resolves your problem described in the referenced issue? Did you notice that the checker is also subscribed to the checkable events? Meaning, this shouldn't provide you any benefit, since the checker will drop the checkables from its queue if they are stopped first.

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.
To illustrate, while a host object is active on one node, its associated notification object can become active on another node. Consequently, each time the host object encounters a problem state, this node must sync the information to the other node to trigger a notification. However, such cluster events can disappear mysteriously if the server is restarted or reloaded at a wrong time. There is already an issue (#10157) and a draft PR #10161 to address this.

@w1ll-i-code
Copy link
Author

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue. I was hoping on a sooner response, but the fix is now running in production and I am sure if anything is still missing, I'll get some feedback in. With the setup I provided in the reproduce, I could observe several lost entries to the ido on every restart. A problem which did no longer occur, once I made the change to the shutdown order. Knowing that I also took the liberty to rearrange some other objects in the startup/shutdown order to behave in accordance with my findings.

Did you notice that the checker is also subscribed to the checkable events?

I have to admit, I am usually not a C++ developer and have had no prior experience with this particular part of the icinga2 code base. I did, in fact, not know that. However I can tell you that this particular fix did resolve the issue, from the test I did perform...

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.

I don't think that's the case, as neither I for the reproducing of the issue, nor the client first reporting this issue have a cluster setup for icinga2. You can have a look at the docker-compose I attached to the issue, to reproduce the bug, to confirm this for yourself. This can be observed on a single icinga2 instance running by its own, without any other cluster nodes or even satellites.

@Al2Klimov Al2Klimov self-requested a review October 23, 2024 10:31
@yhabteab
Copy link
Member

yhabteab commented Oct 23, 2024

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue.

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component. As I already said, if the checkable is stopped first the checker will immediately drop it from its queues. Likewise, the change for the API listener or the endpoint shouldn't be necessary at all, as you don't want to accept new connections while you're on the shutdown phase.

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))? As far as I can see, there is nothing wrong with that service history and I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup:

[15823/144776] Analyzing state changes for host-icinga-lost-statehistory-00946!service-icinga-lost-statechange-05
[15824/144776] Analyzing state changes for icinga2!disk
thread 'main' panicked at src/icingadb.rs:133:75:
called `Option::unwrap()` on a `None` value

But at least it claimed to have found an error for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04:

[217/144776] Analyzing state changes for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04
Idk what's going on host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04: ObjectNotifications { service_id: "004D960037D7EAE178416B5CD110F13F28FD1542", hard_state_changes: 1, notification_count: 2 }
Bildschirmfoto 2024-10-23 um 12 48 47

As you can see in Icinga DB Web, this is not an error, the first notification is triggered due to the state change from PENDING to WARNING and the second one is a reminder notification, since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

{
    "results": [
        {
            "attrs": {
                "__name": "host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04!apply-nt-log-notifications",
                "active": true,
                "command": "nc-log-notification",
                "ha_mode": 0,
                "host_name": "host-icinga-lost-statehistory-00868",
                "interval": 1800,
...

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 23, 2024

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component.

No it is not. I made those changes only after I confirmed it works with the reversed shutdown order....

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))?

I'm sorry, I should have clarified more. From the bottom up, you see a hard state change from ok to warning and the notification that was sent. After that, you see a state change from hard ok to soft warning, meaning the service had a previous hard state changed to ok and it and the corresponding notification are not present in the history.

I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup

had removed those services manually as I did not need them. I did not think that it would break my script.

since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

IYou are right, I only changed the notification interval on the test machine, but forgot to do the same locally before sending the basket. I apologize for that...

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from 4fe3ba3 to 141dedf Compare November 5, 2024 10:19
@Al2Klimov Al2Klimov self-requested a review November 5, 2024 13:21
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so these are the current APs:

lib/base/*logger.ti:12: activation_priority -100;
lib/icinga/icingaapplication.ti:12:     activation_priority -50;
lib/icinga/downtime.ti:23:      activation_priority -10;
lib/*/*.ti:*:      //activation_priority 0;
lib/icinga/scheduleddowntime.ti:24:     activation_priority 20;
lib/remote/apilistener.ti:15:   activation_priority 50;

lib/compat/compatlogger.ti:13:  activation_priority 100;
lib/compat/externalcommandlistener.ti:13:       activation_priority 100;
lib/db_ido_*sql/ido*sqlconnection.ti:12:        activation_priority 100;
lib/icingadb/icingadb.ti:13:    activation_priority 100;
lib/livestatus/livestatuslistener.ti:12:        activation_priority 100;
lib/perfdata/*writer.ti:12:     activation_priority 100;

lib/notification/notificationcomponent.ti:12:   activation_priority 200;
lib/checker/checkercomponent.ti:12:     activation_priority 300;

All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

<walloftext for="devs">
Now, if we travel back in time to better understand how we got here via git log -p -U0 and use the pager to search for "activation_priority", we get:

+- whole .ti files

and the stuff that matters

</walloftext>

This comment was marked as duplicate.

@cla-bot cla-bot bot removed the cla/signed label Nov 11, 2024
@w1ll-i-code
Copy link
Author

@Al2Klimov Thanks for the hints. Now everything should be in order.

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c664d8b to f3ee453 Compare November 11, 2024 10:55
@cla-bot cla-bot bot added the cla/signed label Nov 11, 2024
@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority 250;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that NotificationComponent has a activation_priority of 200, which means, there is some time, where notifications are already/still being sent, but the event is not written into the db. I'd like to avoid that entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then lower NotificationComponent's activation_priority below 100 instead.

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from f3ee453 to c25007c Compare November 12, 2024 10:44
@Al2Klimov Al2Klimov self-requested a review November 12, 2024 11:07
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer the diff is! 👍
However:

@@ -10,6 +10,8 @@ namespace icinga

class Endpoint : ConfigObject
{
activation_priority 300;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Neither Endpoint overrides #Start() or #Stop(), nor any code calls Endpoint#IsActive().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I did not notice till rn, that all communication with the endpoints also counts as ApiListener.... I expected them to behave like the other config objects... My brain hurts now.

@@ -9,7 +9,7 @@ namespace icinga

class NotificationComponent : ConfigObject
{
activation_priority 200;
activation_priority 75;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10191 (comment) sounds like you'd like to do the opposite, i.e to leave this as-is. Imagine, all data outputs (100) are shut down or not yet online, but notifications (75) get sent – silently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10191 (comment) This was your suggestion.... I honestly find this code so confusing that I just accepted it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was fully focused on cleaning up the diff.
Anyway, now after thinking about what's left, I suspect what matters in your case is Notification -5 and this one is not needed or even harmful.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 13, 2024

Somewhere down the line while applying your suggested improvements, the fix broke again. I'll try to reproduce exactly what went wrong.

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c25007c to 4a39831 Compare November 15, 2024 16:29
@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch 2 times, most recently from b97894d to 39977d9 Compare November 22, 2024 09:54
@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority -50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter whether Icinga DB purges Redis and re-inserts everything – or performs a diff – given #10151, not yet activated objects – all checkables – will effectively be deleted. This is not acceptable, so if -50 is actually needed here, we have to:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

} else if (!object->IsActive() &&
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority -50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

} else if (!object->IsActive() &&
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

@w1ll-i-code
Copy link
Author

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

@yhabteab
Copy link
Member

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

Hi @w1ll-i-code, I'm afraid that changing the order of activation_priority won't fully solve the problem, or if it partially does, it will break the existing activation_priority on startup. I'm really sorry for the inconvenience, but I think the underlying problem is much deeper than just changing the activation and deactivation order of objects. It's going to take a lot more time and testing to fix it properly. I don't want to say that this PR is irrelevant, but even with this PR, remotely generated events will still get lost, so we'd love to put in more effort and provide a fix in addition to this PR. However, as we are currently fully occupied with preparations for the next Icinga 2 bugfix release, which unfortunately does not include this, we'll give you an update after the next 2-3 weeks once Icinga 2.14.4 is ready.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 27, 2024

@yhabteab Thanks for the quick reply. I have still some questions:

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?
  2. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

@yhabteab
Copy link
Member

yhabteab commented Nov 28, 2024

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?

Yes, I'm referring to check result events coming from other endpoints.

  1. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

Not exactly something is not working, but so far IcingaDB is activated after hosts, services, downtimes etc. (originally I thought it was actually the other way round, i.e. IcingaDB is activated first and then the other objects, that's my mistake), but with this PR its activation priority has been set to -50, which means it will definitely be activated before anything else and probably run into problems like in #10151 more prominently (edit: NVM! that's irrelevant here, but I'll have to test it out properly, which I don't have that much time for at the moment, but I'll give you an update in the coming weeks).

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 28, 2024

Yes, I'm referring to check result events coming from other endpoints.

@yhabteab The ApiListener should already be shut down at that point (activation_priority 50), right? So it wouldn't impact it?

@yhabteab
Copy link
Member

Unfortunately, stopping the ApiListener object does not stop the listener from accepting new connections and does not immediately disconnect all already connected endpoints either. All the active connections will only be terminated when the process is terminated, not after the API listener has been stopped. @julianbrost was/is working on fixing this, i.e. when the listener is stopped, to manually disconnect all connected endpoints and reject any new connection attempts, but as mentioned before, this will require excessive testing and more time to invest in and will therefore not make it into the next bugfix release.

@w1ll-i-code
Copy link
Author

@yhabteab This should actually not be a problem anymore, as the object is already deactivated at that point (priority 0), meaning it wouldn't change the internal state or dispatch events either way....

Make sure all configobjects are started and shut down in an order, such
that all state changes of monitoring objects are correctly processed
during the shutdown.
@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from 39977d9 to 47a4423 Compare March 10, 2025 13:13
@yhabteab
Copy link
Member

yhabteab commented Mar 17, 2025

Hi @w1ll-i-code, I'm really sorry for the long delay again!

As I said in #10191 (comment), we need to fix the issue you are encountering without having to change the current activation priority of the types. However, to achieve this, we need to adjust the class parser a bit so that a configuration type can provide both activation_priority and deactivation_priority at the same time if needed. Since I'm not sure if you are familiar with the parser or not, I just built support for it and pushed 60bcdc2 right now. Can you please cherry-pick this commit to your branch and revert all the changes you made to the activation_priority and use deactivation_priority instead and see if that fixes your problem? With my commit applied, e.g. Icinga DB could then define both as follows:

activation_priority 100;
deactivation_priority -50;

And don't forget to use GetDeactivationPriority() here instead:

if (a->GetActivationPriority() > b->GetActivationPriority())

EDIT: I have just pushed another commit, please use this f118110 instead of 60bcdc2.

@w1ll-i-code
Copy link
Author

Thanks, I'll test it our first thing tomorrow. I'll let you know as soon as I can.

@w1ll-i-code
Copy link
Author

Hi @yhabteab. I have tested it and it looks stable to me! Would you like me to include your commit in this PR or do I have to wait until that landed?

Sidenote: You need to include climits in the file class_parser.hpp for your commit to compile (at least on my system)

@yhabteab
Copy link
Member

Would you like me to include your commit in this PR or do I have to wait until that landed?

I think it is better to have a separate PR for this, it also simplifies the review process for the colleges, hence I have just #10385.

Sidenote: You need to include climits in the file class_parser.hpp for your commit to compile (at least on my system)

Thanks for the hint! I didn't get any compile errors on my Mac even without that include, so it seems to be a Linux thing 😅.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Icinga2 looses state history and notifications during restart
5 participants