[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

wangzhigang1999 · 2025-07-22T05:52:51Z

Introduced a new configuration property kyuubi.session.engine.shutdown.watchdog.timeout to manage the maximum wait time for engine shutdown.
Updated the SparkSQLEngine to utilize the new watchdog feature.
Minor adjustments to existing configurations and documentation to reflect the new feature.

Why are the changes needed?

Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: #6992 (reply in thread), and these issues: #4280, #7019.

Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered.

How was this patch tested?

For the ThreadDumpUtils utility class, I added unit tests; for the overall process, I conducted an end-to-end test:

Add a thread that never exits to the source code.

Keeping our configuration unchanged, first start an engine, then use kyuubi-ctl in the command line to delete it. Below is the complete log, showing that after one minute, the engine was successfully terminated.

25/07/21 20:20:55 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:32377 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog activated. Engine will be forcefully terminated if graceful shutdown exceeds 1 minute(s).
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/07/21 20:20:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting
25/07/21 20:20:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf90006 closed
25/07/21 20:20:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped.
25/07/21 20:20:55 WARN [main] SparkSQLEngine: Shutdown Watchdog is already running, ignoring duplicate start request
25/07/21 20:20:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0.
25/07/21 20:20:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
25/07/21 20:20:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close
java.lang.IllegalStateException: instance must be started before calling this method
	at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
25/07/21 20:20:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf90006
25/07/21 20:20:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:46861
25/07/21 20:20:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread
25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors
25/07/21 20:20:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped
25/07/21 20:20:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=null, size=4259382, parameter=uuid:45f6a812-52a0-411e-985c-611811c8d5b5, time-in-ms=36, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=4, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [dispatcher-event-loop-2] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/07/21 20:20:55 INFO [main] MemoryStore: MemoryStore cleared
25/07/21 20:20:55 INFO [main] BlockManager: BlockManager stopped
25/07/21 20:20:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
25/07/21 20:20:55 INFO [dispatcher-event-loop-2] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/07/21 20:20:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: EMERGENCY SHUTDOWN TRIGGERED
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Graceful shutdown exceeded 1 minute(s) timeout
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Non-daemon threads are preventing JVM exit
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Initiating forced termination...
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === THREAD DUMP FOR DIAGNOSTIC ===
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: 
================== Thread Dump Start ==================
Timestamp: 2025-07-21 20:21:55
Total threads: 20

Thread Statistics:
  Non-daemon threads:   2
  Daemon threads:      18

Threads by state:
  RUNNABLE       :   5
  TIMED_WAITING  :   9
  WAITING        :   6

==================== Non-Daemon Threads ====================
(These threads prevent JVM from exiting)

Thread: "DestroyJavaVM" #229 
   State: RUNNABLE

Thread: "forever-non-daemon-thread" #88 
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$1.run(SparkSQLEngine.scala:104)
        at java.lang.Thread.run(Thread.java:750)

====================== Daemon Threads ======================

Thread: "BTrace Command Queue Processor" #7 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@33d0d53>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Finalizer" #3 daemon
   State: WAITING
   Waiting on: <java.lang.ref.ReferenceQueue$Lock@568ebe39>
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188)

Thread: "OkHttp ConnectionPool" #79 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@461a0a75>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Okio Watchdog" #78 daemon
   State: TIMED_WAITING
   Waiting on: <java.lang.Class@5586b369>
        at java.lang.Object.wait(Native Method)
        at com.aliyun.datalake.external.okio.AsyncTimeout.awaitTimeout(AsyncTimeout.java:347)
        at com.aliyun.datalake.external.okio.AsyncTimeout$Watchdog.run(AsyncTimeout.java:312)

Thread: "Reference Handler" #2 daemon
   State: WAITING
   Waiting on: <java.lang.ref.Reference$Lock@6ae02e16>
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

Thread: "Signal Dispatcher" #4 daemon
   State: RUNNABLE

Thread: "Thread-0" #5 daemon
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at com.sun.btrace.BTraceRuntime$2.idle(BTraceRuntime.java:402)
        at com.sun.btrace.org.jctools.queues.BaseMpscLinkedArrayQueue.drain(BaseMpscLinkedArrayQueue.java:670)
        at com.sun.btrace.BTraceRuntime$4.run(BTraceRuntime.java:442)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Thread-17" #54 daemon
   State: RUNNABLE

Thread: "Thread-19" #57 daemon
   State: RUNNABLE

Thread: "cached-current-time-updater" #56 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2313e826>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "dlf-token-refresher-0" #76 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6574acf4>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "idle-connection-evictor-1" #66 daemon
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:342)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at com.aliyun.platform.org.apache.hc.core5.util.TimeValue.sleep(TimeValue.java:383)
        at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor.lambda$new$0(IdleConnectionEvictor.java:60)
        at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor$$Lambda$1199/196954001.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:750)

Thread: "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #55 daemon
   State: WAITING
   Waiting on: <java.lang.ref.ReferenceQueue$Lock@43ee1ad6>
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
        at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
        at java.lang.Thread.run(Thread.java:750)

Thread: "pool-29-thread-1" #68 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "pool-30-thread-1" #69 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "process reaper" #224 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@78f9176a>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "shutdown-watchdog" #214 daemon
   State: RUNNABLE
        at sun.management.ThreadImpl.dumpThreads0(Native Method)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:503)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:491)
        at org.apache.kyuubi.util.ThreadDumpUtils$.getAllExtendedThreadInfo(ThreadDumpUtils.scala:91)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToWriter(ThreadDumpUtils.scala:158)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToString(ThreadDumpUtils.scala:119)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToLogger(ThreadDumpUtils.scala:139)
        at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$2.run(SparkSQLEngine.scala:281)

Thread: "yarn-scheduler-endpoint" #213 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4542b55f>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

======================== Summary ========================
Thread ID | Type | State           | Name
----------|------|-----------------|---------------------------
       2 |    D | WAITING         | Reference Handler
       3 |    D | WAITING         | Finalizer
       4 |    D | RUNNABLE        | Signal Dispatcher
       5 |    D | TIMED_WAITING   | Thread-0
       7 |    D | WAITING         | BTrace Command Queue Processor
      54 |    D | RUNNABLE        | Thread-17
      55 |    D | WAITING         | org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner
      56 |    D | TIMED_WAITING   | cached-current-time-updater
      57 |    D | RUNNABLE        | Thread-19
      66 |    D | TIMED_WAITING   | idle-connection-evictor-1
      68 |    D | WAITING         | pool-29-thread-1
      69 |    D | TIMED_WAITING   | pool-30-thread-1
      76 |    D | TIMED_WAITING   | dlf-token-refresher-0
      78 |    D | TIMED_WAITING   | Okio Watchdog
      79 |    D | TIMED_WAITING   | OkHttp ConnectionPool
      88 |    U | TIMED_WAITING   | forever-non-daemon-thread
     213 |    D | WAITING         | yarn-scheduler-endpoint
     214 |    D | RUNNABLE        | shutdown-watchdog
     224 |    D | TIMED_WAITING   | process reaper
     229 |    U | RUNNABLE        | DestroyJavaVM
================== Thread Dump End ==================

25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === END OF THREAD DUMP ===
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Forcefully terminating JVM now...
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@728535c6{/kyuubi,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@bd09a26{/kyuubi/json,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@3065efd0{/kyuubi/session,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@61aa6300{/kyuubi/session/json,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-a0697749-d2d8-4cce-83ac-ddcff8d66323
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-8931f08a-f571-45d8-8657-c541d130cb17

Set kyuubi.session.engine.shutdown.watchdog.timeout to 0, meaning disabling this feature, then repeat the steps above, and the engine will hang indefinitely without ending. It can be seen that the engine got stuck for about 7 minutes and was then manually killed.

25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM

25/07/21 20:34:45 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:34369 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
25/07/21 20:34:45 INFO [main-EventThread] EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Received request of closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416]
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSQLSessionManager: zhigang's SparkSessionImpl with SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416] is closed, current opening sessions 0
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSessionImpl: sessionId=6a4e8d11-f5d3-4607-83bf-04644fc1e416, sessionRunTime=4.0 min, sessionCpuTime=37 s
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Finished closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416]
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0).
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/07/21 20:34:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting
25/07/21 20:34:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf9000d closed
25/07/21 20:34:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped.
25/07/21 20:34:55 INFO [main] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0).
25/07/21 20:34:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0.
25/07/21 20:34:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
25/07/21 20:34:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:33101
25/07/21 20:34:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread
25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors
25/07/21 20:34:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
25/07/21 20:34:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=18, version=6.8.2-nextarch
25/07/21 20:34:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close
java.lang.IllegalStateException: instance must be started before calling this method
	at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
25/07/21 20:34:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf9000d
25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped
25/07/21 20:34:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=null, size=4304759, parameter=uuid:bb87de74-c32e-4886-a8f1-c432052ade6c, time-in-ms=42, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=23, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, size=0, parameter=null, time-in-ms=10, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [dispatcher-event-loop-1] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/07/21 20:34:55 INFO [main] MemoryStore: MemoryStore cleared
25/07/21 20:34:55 INFO [main] BlockManager: BlockManager stopped
25/07/21 20:34:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
25/07/21 20:34:55 INFO [dispatcher-event-loop-1] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@766db6f9{/kyuubi,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@244f356{/kyuubi/json,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@495f3965{/kyuubi/session,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@7aaf6bfd{/kyuubi/session/json,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-3b2e31df-0f57-4b8b-bdd5-b9218aec09ff
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-4adc1b87-4c50-4b77-9dfe-c40d055f5c8c

Was this patch authored or co-authored using generative AI tooling?

No

- Introduced a new configuration property `kyuubi.session.engine.shutdown.watchdog.timeout` to manage the maximum wait time for engine shutdown. - Updated the `SparkSQLEngine` to utilize the new watchdog feature. - Minor adjustments to existing configurations and documentation to reflect the new feature.

wangzhigang1999 · 2025-07-22T12:36:02Z

#7149

wForget · 2025-08-08T03:51:20Z

Thanks for your work, should we introduce this feature as a plugin? Like in kyuubi-spark-jvm-quake, and cc @pan3793

turboFei · 2025-08-08T05:29:17Z

@wangzhigang1999 could you check #7167?

Does it help for your use case?

wangzhigang1999 · 2025-08-08T05:55:08Z

@wangzhigang1999 could you check #7167?

Does it help for your use case?

The issue I’m actually facing is that the Ranger component starts a non-daemon thread, which does not exit when the engine shuts down, causing the entire process to hang and not terminate.

From what I see in this PR, while it ensures that the SessionManager will proactively stop itself once the SparkContext is stopped, it cannot resolve the situation where a non-daemon thread (like the one started by Ranger) is still alive. In such cases, even if the SessionManager stops, the JVM process will not exit.

This is why I introduced #7150. Once the stop signal is triggered, I will start a shutdown watchdog, which will forcefully terminate the entire process after the timeout.

github-actions bot added kind:documentation Documentation is a feature! module:spark module:common labels Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

Uh oh!

wangzhigang1999 commented Jul 22, 2025

Uh oh!

wangzhigang1999 commented Jul 22, 2025

Uh oh!

wForget commented Aug 8, 2025

Uh oh!

turboFei commented Aug 8, 2025

Uh oh!

wangzhigang1999 commented Aug 8, 2025

Uh oh!

Uh oh!

[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

Are you sure you want to change the base?

[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

Uh oh!

Conversation

wangzhigang1999 commented Jul 22, 2025

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wangzhigang1999 commented Jul 22, 2025

Uh oh!

wForget commented Aug 8, 2025

Uh oh!

turboFei commented Aug 8, 2025

Uh oh!

wangzhigang1999 commented Aug 8, 2025

Uh oh!

Uh oh!