[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

kopf-archiver · 2020-08-18T19:57:25Z

A pull request by nolar at 2019-07-14 17:39:46+00:00
Original URL: zalando-incubator/kopf#147
Merged by nolar at 2019-07-15 15:42:22+00:00

Part of an effort to make Kopf-based operators more stable and resilient.

Issue: #142, partially replaces #143.

Description

First of all, react to SIGINT/SIGTERM at all. Previously, the process was terminated on the OS level, but not via the intended reaction.

Second, when exiting for any reason — e.g. SIGINT/SIGTERM, or explicit cancellation, or any exception — do this as gracefully as possible. For this, first cancel the core ever-running tasks of the framework; give them some time to finish; then cancel all running tasks & sub-tasks if left; give it some time again; and only then actually exit.

Third, add a check for all tests on whether any pending tasks are left unattended (i.e. scheduled but not executed) — which is especially important for the e2e tests.

There are no behaviour changes except of a better termination.

Types of Changes

Bug fix (non-breaking change which fixes an issue)
Refactor/improvements

The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators. However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled. When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left an several tasks running in the background: * The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`). * The scheduler's internal job closing tasks. * The scheduler's task for processing the job failures (implementation details). At that time, the operators/tests were already going forward and were not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often). With this change, the anti-cancellation shield is reinforced to: 1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing. 2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled. This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.

The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators. However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled. When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left several tasks running in the background: * The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`). * The scheduler's internal job closing tasks. * The scheduler's task for processing the job failures (implementation details). At that time, the operators/tests were already going forward and not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often). With this change, the anti-cancellation shield is reinforced to: 1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing. 2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled. This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.

kopf-archiver bot added the archive label Aug 18, 2020

kopf-archiver bot closed this as completed Aug 18, 2020

kopf-archiver bot mentioned this issue Aug 19, 2020

[PR] Die properly: fast and with no remnants #143

Closed

2 tasks

kopf-archiver bot changed the title ~~[archival placeholder]~~ [PR] Gracefully terminate an operator on SIGINT/SIGTERM Aug 19, 2020

kopf-archiver bot added the enhancement New feature or request label Aug 19, 2020

This was referenced Aug 19, 2020

[PR] Fix the stop-flag setting: no arguments are expected #151

Closed

[PR] Refactor runtime to full async/await; make officially embeddable #156

Closed

nolar mentioned this issue Jan 2, 2021

Reinforce the watchers's shields from cancellations #628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

Comments

kopf-archiver bot commented Aug 18, 2020 • edited Loading

Description

Types of Changes

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading