[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150
+630
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
kyuubi.session.engine.shutdown.watchdog.timeout
to manage the maximum wait time for engine shutdown.SparkSQLEngine
to utilize the new watchdog feature.Why are the changes needed?
Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: #6992 (reply in thread), and these issues: #4280, #7019.
Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered.
How was this patch tested?
For the ThreadDumpUtils utility class, I added unit tests; for the overall process, I conducted an end-to-end test:
Was this patch authored or co-authored using generative AI tooling?
No