Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

Closed
kopf-archiver bot opened this issue Aug 18, 2020 · 0 comments
Closed

[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147

kopf-archiver bot opened this issue Aug 18, 2020 · 0 comments
Labels
archive enhancement New feature or request

Comments

@kopf-archiver
Copy link

kopf-archiver bot commented Aug 18, 2020

A pull request by nolar at 2019-07-14 17:39:46+00:00
Original URL: zalando-incubator/kopf#147
Merged by nolar at 2019-07-15 15:42:22+00:00

Part of an effort to make Kopf-based operators more stable and resilient.

Issue: #142, partially replaces #143.

Description

First of all, react to SIGINT/SIGTERM at all. Previously, the process was terminated on the OS level, but not via the intended reaction.

Second, when exiting for any reason — e.g. SIGINT/SIGTERM, or explicit cancellation, or any exception — do this as gracefully as possible. For this, first cancel the core ever-running tasks of the framework; give them some time to finish; then cancel all running tasks & sub-tasks if left; give it some time again; and only then actually exit.

Third, add a check for all tests on whether any pending tasks are left unattended (i.e. scheduled but not executed) — which is especially important for the e2e tests.

There are no behaviour changes except of a better termination.

Types of Changes

  • Bug fix (non-breaking change which fixes an issue)
  • Refactor/improvements
@kopf-archiver kopf-archiver bot closed this as completed Aug 18, 2020
@kopf-archiver kopf-archiver bot changed the title [archival placeholder] [PR] Gracefully terminate an operator on SIGINT/SIGTERM Aug 19, 2020
@kopf-archiver kopf-archiver bot added the enhancement New feature or request label Aug 19, 2020
nolar added a commit that referenced this issue Jan 2, 2021
The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators.

However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled.

When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left an several tasks running in the background:

* The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`).
* The scheduler's internal job closing tasks.
* The scheduler's task for processing the job failures (implementation details).

At that time, the operators/tests were already going forward and were not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often).

With this change, the anti-cancellation shield is reinforced to:

1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing.
2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled.

This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.
nolar added a commit that referenced this issue Jan 2, 2021
The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators.

However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled.

When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left several tasks running in the background:

* The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`).
* The scheduler's internal job closing tasks.
* The scheduler's task for processing the job failures (implementation details).

At that time, the operators/tests were already going forward and not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often).

With this change, the anti-cancellation shield is reinforced to:

1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing.
2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled.

This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archive enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

0 participants