-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PR] Gracefully terminate an operator on SIGINT/SIGTERM #147
Labels
Comments
2 tasks
kopf-archiver
bot
changed the title
[archival placeholder]
[PR] Gracefully terminate an operator on SIGINT/SIGTERM
Aug 19, 2020
This was referenced Aug 19, 2020
nolar
added a commit
that referenced
this issue
Jan 2, 2021
The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators. However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled. When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left an several tasks running in the background: * The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`). * The scheduler's internal job closing tasks. * The scheduler's task for processing the job failures (implementation details). At that time, the operators/tests were already going forward and were not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often). With this change, the anti-cancellation shield is reinforced to: 1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing. 2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled. This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.
nolar
added a commit
that referenced
this issue
Jan 2, 2021
The shielding has been introduced in e0ae2ea for #147 for general stability of tests and for graceful termination of operators. However, the shielding was only done for one operation in the `finally:` block — the scheduler closing — and that shield could itself be cancelled. When the watcher was double-cancelled (with the second cancellation hitting somewhere in the `finally:` block), it left several tasks running in the background: * The cancellation task itself (implicitly converted to a future/task by `asyncio.shield()`). * The scheduler's internal job closing tasks. * The scheduler's task for processing the job failures (implementation details). At that time, the operators/tests were already going forward and not going to wait for these tasks to finish. This led to system resource (asyncio tasks) leakage in some tests; it was also probable in operators (though, they do not cancel the watchers often). With this change, the anti-cancellation shield is reinforced to: 1. Shield the streams depletion (limited by an "exit timeout") too; previously, it was ignored and remained cancellable without getting to the scheduler closing. 2. Wait until the real completion of both the streams depletion and the scheduler closure; previously, the wait could be cancelled. This ensures that under no circumstances any worker is left unattended after the watcher exits — all workers are now guaranteed to finish or be cancelled as part of the watcher's cancellation/finalisation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Part of an effort to make Kopf-based operators more stable and resilient.
Description
First of all, react to SIGINT/SIGTERM at all. Previously, the process was terminated on the OS level, but not via the intended reaction.
Second, when exiting for any reason — e.g. SIGINT/SIGTERM, or explicit cancellation, or any exception — do this as gracefully as possible. For this, first cancel the core ever-running tasks of the framework; give them some time to finish; then cancel all running tasks & sub-tasks if left; give it some time again; and only then actually exit.
Third, add a check for all tests on whether any pending tasks are left unattended (i.e. scheduled but not executed) — which is especially important for the e2e tests.
There are no behaviour changes except of a better termination.
Types of Changes
The text was updated successfully, but these errors were encountered: