Skip to content

[KYUUBI #7149] feat: Add shutdown watchdog to forcefully terminate the spark engine and prevent resource leaks. #7150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wangzhigang1999
Copy link
Contributor

  • Introduced a new configuration property kyuubi.session.engine.shutdown.watchdog.timeout to manage the maximum wait time for engine shutdown.
  • Updated the SparkSQLEngine to utilize the new watchdog feature.
  • Minor adjustments to existing configurations and documentation to reflect the new feature.

Why are the changes needed?

Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: #6992 (reply in thread), and these issues: #4280, #7019.

Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered.

image

How was this patch tested?

For the ThreadDumpUtils utility class, I added unit tests; for the overall process, I conducted an end-to-end test:

  1. Add a thread that never exits to the source code.
image
  1. Keeping our configuration unchanged, first start an engine, then use kyuubi-ctl in the command line to delete it. Below is the complete log, showing that after one minute, the engine was successfully terminated.
25/07/21 20:20:55 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:32377 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog activated. Engine will be forcefully terminated if graceful shutdown exceeds 1 minute(s).
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/07/21 20:20:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting
25/07/21 20:20:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf90006 closed
25/07/21 20:20:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped.
25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped.
25/07/21 20:20:55 WARN [main] SparkSQLEngine: Shutdown Watchdog is already running, ignoring duplicate start request
25/07/21 20:20:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0.
25/07/21 20:20:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
25/07/21 20:20:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close
java.lang.IllegalStateException: instance must be started before calling this method
	at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
25/07/21 20:20:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf90006
25/07/21 20:20:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:46861
25/07/21 20:20:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread
25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors
25/07/21 20:20:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped
25/07/21 20:20:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=null, size=4259382, parameter=uuid:45f6a812-52a0-411e-985c-611811c8d5b5, time-in-ms=36, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=4, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
25/07/21 20:20:55 INFO [dispatcher-event-loop-2] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/07/21 20:20:55 INFO [main] MemoryStore: MemoryStore cleared
25/07/21 20:20:55 INFO [main] BlockManager: BlockManager stopped
25/07/21 20:20:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
25/07/21 20:20:55 INFO [dispatcher-event-loop-2] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/07/21 20:20:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: EMERGENCY SHUTDOWN TRIGGERED
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Graceful shutdown exceeded 1 minute(s) timeout
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Non-daemon threads are preventing JVM exit
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Initiating forced termination...
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === THREAD DUMP FOR DIAGNOSTIC ===
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: 
================== Thread Dump Start ==================
Timestamp: 2025-07-21 20:21:55
Total threads: 20

Thread Statistics:
  Non-daemon threads:   2
  Daemon threads:      18

Threads by state:
  RUNNABLE       :   5
  TIMED_WAITING  :   9
  WAITING        :   6

==================== Non-Daemon Threads ====================
(These threads prevent JVM from exiting)

Thread: "DestroyJavaVM" #229 
   State: RUNNABLE

Thread: "forever-non-daemon-thread" #88 
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$1.run(SparkSQLEngine.scala:104)
        at java.lang.Thread.run(Thread.java:750)

====================== Daemon Threads ======================

Thread: "BTrace Command Queue Processor" #7 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@33d0d53>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Finalizer" #3 daemon
   State: WAITING
   Waiting on: <java.lang.ref.ReferenceQueue$Lock@568ebe39>
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188)

Thread: "OkHttp ConnectionPool" #79 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@461a0a75>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Okio Watchdog" #78 daemon
   State: TIMED_WAITING
   Waiting on: <java.lang.Class@5586b369>
        at java.lang.Object.wait(Native Method)
        at com.aliyun.datalake.external.okio.AsyncTimeout.awaitTimeout(AsyncTimeout.java:347)
        at com.aliyun.datalake.external.okio.AsyncTimeout$Watchdog.run(AsyncTimeout.java:312)

Thread: "Reference Handler" #2 daemon
   State: WAITING
   Waiting on: <java.lang.ref.Reference$Lock@6ae02e16>
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

Thread: "Signal Dispatcher" #4 daemon
   State: RUNNABLE

Thread: "Thread-0" #5 daemon
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at com.sun.btrace.BTraceRuntime$2.idle(BTraceRuntime.java:402)
        at com.sun.btrace.org.jctools.queues.BaseMpscLinkedArrayQueue.drain(BaseMpscLinkedArrayQueue.java:670)
        at com.sun.btrace.BTraceRuntime$4.run(BTraceRuntime.java:442)
        at java.lang.Thread.run(Thread.java:750)

Thread: "Thread-17" #54 daemon
   State: RUNNABLE

Thread: "Thread-19" #57 daemon
   State: RUNNABLE

Thread: "cached-current-time-updater" #56 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2313e826>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "dlf-token-refresher-0" #76 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6574acf4>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "idle-connection-evictor-1" #66 daemon
   State: TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:342)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at com.aliyun.platform.org.apache.hc.core5.util.TimeValue.sleep(TimeValue.java:383)
        at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor.lambda$new$0(IdleConnectionEvictor.java:60)
        at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor$$Lambda$1199/196954001.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:750)

Thread: "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #55 daemon
   State: WAITING
   Waiting on: <java.lang.ref.ReferenceQueue$Lock@43ee1ad6>
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
        at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
        at java.lang.Thread.run(Thread.java:750)

Thread: "pool-29-thread-1" #68 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "pool-30-thread-1" #69 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "process reaper" #224 daemon
   State: TIMED_WAITING
   Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@78f9176a>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Thread: "shutdown-watchdog" #214 daemon
   State: RUNNABLE
        at sun.management.ThreadImpl.dumpThreads0(Native Method)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:503)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:491)
        at org.apache.kyuubi.util.ThreadDumpUtils$.getAllExtendedThreadInfo(ThreadDumpUtils.scala:91)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToWriter(ThreadDumpUtils.scala:158)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToString(ThreadDumpUtils.scala:119)
        at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToLogger(ThreadDumpUtils.scala:139)
        at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$2.run(SparkSQLEngine.scala:281)

Thread: "yarn-scheduler-endpoint" #213 daemon
   State: WAITING
   Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4542b55f>
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

======================== Summary ========================
Thread ID | Type | State           | Name
----------|------|-----------------|---------------------------
       2 |    D | WAITING         | Reference Handler
       3 |    D | WAITING         | Finalizer
       4 |    D | RUNNABLE        | Signal Dispatcher
       5 |    D | TIMED_WAITING   | Thread-0
       7 |    D | WAITING         | BTrace Command Queue Processor
      54 |    D | RUNNABLE        | Thread-17
      55 |    D | WAITING         | org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner
      56 |    D | TIMED_WAITING   | cached-current-time-updater
      57 |    D | RUNNABLE        | Thread-19
      66 |    D | TIMED_WAITING   | idle-connection-evictor-1
      68 |    D | WAITING         | pool-29-thread-1
      69 |    D | TIMED_WAITING   | pool-30-thread-1
      76 |    D | TIMED_WAITING   | dlf-token-refresher-0
      78 |    D | TIMED_WAITING   | Okio Watchdog
      79 |    D | TIMED_WAITING   | OkHttp ConnectionPool
      88 |    U | TIMED_WAITING   | forever-non-daemon-thread
     213 |    D | WAITING         | yarn-scheduler-endpoint
     214 |    D | RUNNABLE        | shutdown-watchdog
     224 |    D | TIMED_WAITING   | process reaper
     229 |    U | RUNNABLE        | DestroyJavaVM
================== Thread Dump End ==================

25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === END OF THREAD DUMP ===
25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Forcefully terminating JVM now...
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@728535c6{/kyuubi,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@bd09a26{/kyuubi/json,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@3065efd0{/kyuubi/session,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@61aa6300{/kyuubi/session/json,null,STOPPED,@Spark}
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-a0697749-d2d8-4cce-83ac-ddcff8d66323
25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-8931f08a-f571-45d8-8657-c541d130cb17
  1. Set kyuubi.session.engine.shutdown.watchdog.timeout to 0, meaning disabling this feature, then repeat the steps above, and the engine will hang indefinitely without ending. It can be seen that the engine got stuck for about 7 minutes and was then manually killed.
25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM
25/07/21 20:34:45 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:34369 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
25/07/21 20:34:45 INFO [main-EventThread] EngineServiceDiscovery: 1 connection(s) are active, delay shutdown
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Received request of closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416]
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSQLSessionManager: zhigang's SparkSessionImpl with SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416] is closed, current opening sessions 0
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSessionImpl: sessionId=6a4e8d11-f5d3-4607-83bf-04644fc1e416, sessionRunTime=4.0 min, sessionCpuTime=37 s
25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Finished closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416]
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0).
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/07/21 20:34:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting
25/07/21 20:34:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf9000d closed
25/07/21 20:34:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped.
25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped.
25/07/21 20:34:55 INFO [main] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0).
25/07/21 20:34:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0.
25/07/21 20:34:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
25/07/21 20:34:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:33101
25/07/21 20:34:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread
25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors
25/07/21 20:34:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
25/07/21 20:34:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=18, version=6.8.2-nextarch
25/07/21 20:34:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close
java.lang.IllegalStateException: instance must be started before calling this method
	at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
	at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
25/07/21 20:34:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf9000d
25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped
25/07/21 20:34:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=null, size=4304759, parameter=uuid:bb87de74-c32e-4886-a8f1-c432052ade6c, time-in-ms=42, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=23, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, size=0, parameter=null, time-in-ms=10, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
25/07/21 20:34:55 INFO [dispatcher-event-loop-1] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/07/21 20:34:55 INFO [main] MemoryStore: MemoryStore cleared
25/07/21 20:34:55 INFO [main] BlockManager: BlockManager stopped
25/07/21 20:34:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
25/07/21 20:34:55 INFO [dispatcher-event-loop-1] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@766db6f9{/kyuubi,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@244f356{/kyuubi/json,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@495f3965{/kyuubi/session,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@7aaf6bfd{/kyuubi/session/json,null,STOPPED,@Spark}
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-3b2e31df-0f57-4b8b-bdd5-b9218aec09ff
25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-4adc1b87-4c50-4b77-9dfe-c40d055f5c8c

Was this patch authored or co-authored using generative AI tooling?

No

- Introduced a new configuration property `kyuubi.session.engine.shutdown.watchdog.timeout` to manage the maximum wait time for engine shutdown.
- Updated the `SparkSQLEngine` to utilize the new watchdog feature.
- Minor adjustments to existing configurations and documentation to reflect the new feature.
@wangzhigang1999
Copy link
Contributor Author

#7149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant