Skip to content

Fix scheduling of downtimes for all services on child hosts #8990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 7, 2021

Conversation

julianbrost
Copy link
Contributor

@julianbrost julianbrost commented Sep 3, 2021

Fundamentally, scheduling downtimes for all services on child hosts was implemented, it just iterated over the services of the wrong host resulting in duplicate downtimes being scheduled for services of the parent host. This PR fixes this loop and in addition prevents the creation of duplicate downtimes for individual services in case that both the service and its host are children of the parent checkable.

Tests

Test config: 3 hosts, each with two services, service on on hosts child1 and child2 depends on host parent, additionally, host child1 depends on host parent.

object Host "github-8989-parent" {
	check_command = "dummy"
}

object Host "github-8989-child1" {
	check_command = "dummy"
}

object Host "github-8989-child2" {
	check_command = "dummy"
}

apply Service "github-8989-service" for (name in ["1", "2"]) {
	check_command = "dummy"
	assign where match("github-8989-*", host.name)
}

object Dependency "github-8989-c1" {
	parent_host_name = "github-8989-parent"
	child_host_name = "github-8989-child1"
}

object Dependency "github-8989-c1-s1" {
	parent_host_name = "github-8989-parent"
	child_host_name = "github-8989-child1"
	child_service_name = "github-8989-service1"
}

object Dependency "github-8989-c2-s1" {
	parent_host_name = "github-8989-parent"
	child_host_name = "github-8989-child2"
	child_service_name = "github-8989-service1"
}

Please ignore the service keep-state in the following output, that's just an additional service I apply to every host in my test cluster.

Before

$ curl -iskSu root:icinga -H 'Accept: application/json' -X POST "https://127.0.0.1:5665/v1/actions/schedule-downtime" -d '{"type":"Host", "filter":"host.name == \"github-8989-parent\"", "start_time":'$(date +%s)', "end_time":'$(date -d +5min +%s)', "author":"icingaadmin", "comment":"42", "pretty":true, "all_services":1, "child_options":"DowntimeTriggeredChildren"}'
HTTP/1.1 200 OK
Server: Icinga/v2.13.0-22-g95cdc00ad
Content-Type: application/json
Content-Length: 2172

{
    "results": [
        {
            "child_downtimes": [
                {
                    "legacy_id": 18,
                    "name": "github-8989-child1!65a4a10e-473c-4ae1-9e46-96cf4ce7e463",
                    "service_downtimes": [
                        {
                            "legacy_id": 19,
                            "name": "github-8989-parent!github-8989-service1!8a7ad4e7-cd20-4c4e-8d7c-caeea1249503"
                        },
                        {
                            "legacy_id": 20,
                            "name": "github-8989-parent!github-8989-service2!f563e60f-7b9d-48fb-a92d-c99c51da14b3"
                        },
                        {
                            "legacy_id": 21,
                            "name": "github-8989-parent!keep-state!15a20b8b-7ba6-49c2-882f-440494443f50"
                        }
                    ]
                },
                {
                    "legacy_id": 22,
                    "name": "github-8989-child2!github-8989-service1!1673890d-ed43-43d0-8f19-787657fce90c"
                },
                {
                    "legacy_id": 23,
                    "name": "github-8989-child1!github-8989-service1!bf162feb-f79b-4286-a310-0694073005c8"
                }
            ],
            "code": 200,
            "legacy_id": 14,
            "name": "github-8989-parent!4aca458c-2c6c-49f6-ac87-e9e57f284726",
            "service_downtimes": [
                {
                    "legacy_id": 15,
                    "name": "github-8989-parent!github-8989-service1!df16c710-fd6f-48f6-973b-cbcaaad53738"
                },
                {
                    "legacy_id": 16,
                    "name": "github-8989-parent!github-8989-service2!f3f3b6a5-6b94-4e09-8266-da0ff6bafed2"
                },
                {
                    "legacy_id": 17,
                    "name": "github-8989-parent!keep-state!290c7080-6712-4d3a-befb-fadaaaa4dd99"
                }
            ],
            "status": "Successfully scheduled downtime 'github-8989-parent!4aca458c-2c6c-49f6-ac87-e9e57f284726' for object 'github-8989-parent'."
        }
    ]
}

Note how the service_downtimes of downtime github-8989-child1!65a4a10e-473c-4ae1-9e46-96cf4ce7e463 lists downtime for github-8989-parent.

After

$ curl -iskSu root:icinga -H 'Accept: application/json' -X POST "https://127.0.0.1:5665/v1/actions/schedule-downtime" -d '{"type":"Host", "filter":"host.name == \"github-8989-parent\"", "start_time":'$(date +%s)', "end_time":'$(date -d +5min +%s)', "author":"icingaadmin", "comment":"42", "pretty":true, "all_services":1, "child_options":"DowntimeTriggeredChildren"}'
HTTP/1.1 200 OK
Server: Icinga/v2.13.0-24-gbb0dcdf0b
Content-Type: application/json
Content-Length: 1991

{
    "results": [
        {
            "child_downtimes": [
                {
                    "legacy_id": 21,
                    "name": "github-8989-child2!github-8989-service1!a8d2f679-1b12-493d-97a1-d080d87a02f1"
                },
                {
                    "legacy_id": 22,
                    "name": "github-8989-child1!82279e74-7225-477b-bfe0-b150c637afaa",
                    "service_downtimes": [
                        {
                            "legacy_id": 23,
                            "name": "github-8989-child1!github-8989-service1!8c488f11-b76e-47f0-b264-7fb1b5cb8fd0"
                        },
                        {
                            "legacy_id": 24,
                            "name": "github-8989-child1!github-8989-service2!5bddb234-a4b6-4957-a42f-c45012e0c9d6"
                        },
                        {
                            "legacy_id": 25,
                            "name": "github-8989-child1!keep-state!d7d3b838-1fa2-4d17-81d1-89c985bc25cd"
                        }
                    ]
                }
            ],
            "code": 200,
            "legacy_id": 17,
            "name": "github-8989-parent!4cc56eef-6d6b-4a00-a575-3489c17e185c",
            "service_downtimes": [
                {
                    "legacy_id": 18,
                    "name": "github-8989-parent!github-8989-service1!7a59a20a-eeb0-442a-8890-aeeb3b2a50a7"
                },
                {
                    "legacy_id": 19,
                    "name": "github-8989-parent!github-8989-service2!b320919a-f888-46e5-940c-9384ab3ff369"
                },
                {
                    "legacy_id": 20,
                    "name": "github-8989-parent!keep-state!65cf7935-adb4-43ec-9605-11720322bf94"
                }
            ],
            "status": "Successfully scheduled downtime 'github-8989-parent!4cc56eef-6d6b-4a00-a575-3489c17e185c' for object 'github-8989-parent'."
        }
    ]
}

Now service_downtimes for github-8989-child1!82279e74-7225-477b-bfe0-b150c637afaa indeed contains downtimes for all services of host github-8989-child1. Also note that github-8989-child2!service1 part of that list and there is no duplicate downtime for it in child_downtimes.

Deleting the downtime for parent also deletes the downtimes created for services of that host but leaves downtimes of other hosts and their services unaffected. Deleting the downtime for child1 also deletes the downtimes for all its serivces (including service1 which is a child itself).

Special thanks to @NicolasGoeddel for pointing out the cause of this bug.

fixes #8989

@cla-bot cla-bot bot added the cla/signed label Sep 3, 2021
@icinga-probot icinga-probot bot added area/runtime Downtimes, comments, dependencies, events bug Something isn't working ref/IP labels Sep 3, 2021
@julianbrost julianbrost added the consider backporting Should be considered for inclusion in a bugfix release label Sep 3, 2021
@nilmerg
Copy link
Member

nilmerg commented Sep 3, 2021

Will cancelling the parent downtime now also cancel the service downtimes of children?

Comment on lines 465 to 472
Downtime::Ptr serviceDowntime = Downtime::AddDowntime(hostService, author, comment, startTime, endTime,
Downtime::Ptr serviceDowntime = Downtime::AddDowntime(childService, author, comment, startTime, endTime,
fixed, triggerName, duration);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will cancelling the parent downtime now also cancel the service downtimes of children?

Looks like it won't as #8913 didn't add the information about that relation to this call. But I think it should, so I'll change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, it's not only missing in that place but also in line 442. @N-o-X was this a deliberate decision to only set the parent relation for those downtimes created by all_services but not child_options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I think I can remember the discussion we had about that (looks like we haven't written down the result over there in the PR).

We decided on not propagating the deletion down to dependencies as it's not clear what should happen if a downtime somewhere in the middle of a dependency chain is deleted. Intuitively, one could say that this should delete all the downtimes in that sub-tree, however dependencies are a graph, not a tree, so a dependant child could be in there due to multiple dependencies.

What's definitely an oversight in the current PR that service downtimes on child hosts don't have a relation to the downtime on the host. This should be addressed in any case. I'm no longer sure about dependencies. If we want to also delete then, this should be handled separately as this needs larger changes to what we store and how we handle deletes (as described in the second paragraph) and also a downtime can only one parent right now, so it can't have a relation to both the downtime on the child host and the root downtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye fine, but then should the documentation get updated (https://icinga.com/docs/icinga-2/latest/doc/16-upgrading-icinga-2/#deletion-of-child-downtimes-on-services) before this fix is propagated. Otherwise the expected result is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see anything wrong with the documentation. It says all_services flag and nothing about child_options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with this change the all_services flag applies also to childs. If that's the case, and I re-read the documentation with this in mind, I expect that these service downtimes of children also get removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting a downtime of a child host will now also delete downtimes for service of that particular child host.

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@julianbrost julianbrost marked this pull request as draft September 3, 2021 12:53
The loop iterated over the services of the wrong host resulting in duplicate
downtimes scheduled for services of the parent host instead of downtimes for
services of the child host.
@julianbrost julianbrost force-pushed the bugfix/downtime-all-services-on-child-hosts branch from b762af7 to bb0dcdf Compare September 3, 2021 13:53
@julianbrost julianbrost marked this pull request as ready for review September 3, 2021 14:04
@julianbrost
Copy link
Contributor Author

I've pushed a new version that handles the deletion of child downtimes for the all_services case. Please have another look.

@nilmerg Please ack if that behavior is ok for you. Otherwise please create an issue for also handling the propagation of deletion for child_options, but be aware that this would require some larger rework of how dependencies are handled.

@nilmerg
Copy link
Member

nilmerg commented Sep 3, 2021

I just want that creating and removing downtimes behaves the same regarding the all_services flag. If a child's services get a downtime due to that flag and the documentation tells me that service downtimes created with this flag get automatically removed, and if this is in fact the case.., then it's fine. If that's not the case the documentation has to reflect that.

@julianbrost
Copy link
Contributor Author

Does the current behavior (i.e. downtime removal on any hosts propagates to the services of that very host) match the documentation? The only problem I could see is that the term "child" is sometimes used for the relation between a service and the host it belongs to and sometimes for the relation given by an explicit Dependency object.

@nilmerg
Copy link
Member

nilmerg commented Sep 3, 2021

From: https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/#schedule-downtime

Sets downtime for all services for the matched host objects. If child_options are set, all child hosts and their services will schedule a downtime too.

From: https://icinga.com/docs/icinga-2/latest/doc/16-upgrading-icinga-2/#deletion-of-child-downtimes-on-services

Service downtimes created while using the all_services flag on the schedule-downtime API action will now automatically be deleted when deleting the hosts downtime.

The latter only speaks about the parent host's services. It needs to clarify what happens to child hosts's services.

Btw: I miss that part here: https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/#remove-downtime

@xenuser
Copy link
Contributor

xenuser commented Oct 12, 2021

@nilmerg Can you provide a short update when this will be moving forward? Thanks in advance.

@julianbrost
Copy link
Contributor Author

I've added another commit that documents how downtime removals using /v1/actions/remove-downtime propagate to other downtimes.

Copy link
Member

@nilmerg nilmerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are fine now, thanks @julianbrost!

@NicolasGoeddel
Copy link

NicolasGoeddel commented Dec 3, 2021

Can anyone tell me when the new release will be out? @lippserd told me he thinks it will be out in late November.
If I understand this correctly it should be available in v2.13.3 and v2.12.7. I would be happy if there is already a final date.

@Al2Klimov Al2Klimov merged commit 31c5641 into master Dec 7, 2021
@icinga-probot icinga-probot bot deleted the bugfix/downtime-all-services-on-child-hosts branch December 7, 2021 11:48
@julianbrost julianbrost added this to the 2.14.0 milestone Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Downtimes, comments, dependencies, events backported Fix was included in a bugfix release bug Something isn't working cla/signed ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downtime not scheduled for services on child hosts
7 participants