Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker doesn't get killed when the shutdown exceeds brokerShutdownTimeoutMs #15650

Open
lhotari opened this issue May 18, 2022 · 7 comments
Open
Labels
area/broker Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented May 18, 2022

Describe the bug

Broker should get killed after brokerShutdownTimeoutMs, which is 60 seconds by default.

# Time to wait for broker graceful shutdown. After this time elapses, the process will be killed
brokerShutdownTimeoutMs=60000

I have noticed that sometimes broker pods don't get stopped and pods will have to be manually killed.

To Reproduce

I haven't found a way to reproduce the problem.

Expected behavior

Broker should get killed after brokerShutdownTimeoutMs

Additional context

It looks like the brokerShutdownTimeoutMs feature stopped working as expected when MessagingServiceShutdownHook got removed in #12146 .

It would be useful to print out the thread dump to System.out before killing the process with Runtime.getRuntime().halt(1). This could help investigate why the shutdown is stuck.

System.out.println(ThreadDumpUtil.buildThreadDiagnosticString());
@Technoboy-
Copy link
Contributor

2022-05-19T14:53:48.9444293Z "main" #1 prio=5 os_prio=0 cpu=53475.14ms elapsed=3501.00s tid=0x00007f6ba0024380 nid=0xab5 in Object.wait()  [0x00007f6ba79ba000]
2022-05-19T14:53:48.9444744Z    java.lang.Thread.State: WAITING (on object monitor)
2022-05-19T14:53:48.9445112Z 	at java.lang.Object.wait(java.base@17.0.3/Native Method)
2022-05-19T14:53:48.9445718Z 	- waiting on <no object reference available>
2022-05-19T14:53:48.9446059Z 	at java.lang.Object.wait(java.base@17.0.3/Object.java:338)
2022-05-19T14:53:48.9446564Z 	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:275)
2022-05-19T14:53:48.9447215Z 	- locked <0x00000000c39ae400> (a io.netty.channel.DefaultChannelPromise)
2022-05-19T14:53:48.9447776Z 	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
2022-05-19T14:53:48.9459968Z 	at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
2022-05-19T14:53:48.9460707Z 	at org.apache.bookkeeper.proto.PerChannelBookieClient.closeInternal(PerChannelBookieClient.java:1089)
2022-05-19T14:53:48.9461401Z 	at org.apache.bookkeeper.proto.PerChannelBookieClient.close(PerChannelBookieClient.java:1071)
2022-05-19T14:53:48.9462119Z 	at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.close(DefaultPerChannelBookieClientPool.java:157)
2022-05-19T14:53:48.9462799Z 	at org.apache.bookkeeper.proto.BookieClientImpl.close(BookieClientImpl.java:587)
2022-05-19T14:53:48.9463329Z 	at org.apache.bookkeeper.client.BookKeeper.close(BookKeeper.java:1454)
2022-05-19T14:53:48.9463919Z 	at org.apache.pulsar.broker.ManagedLedgerClientFactory.close(ManagedLedgerClientFactory.java:143)
2022-05-19T14:53:48.9464522Z 	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:465)
2022-05-19T14:53:48.9465045Z 	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:372)
2022-05-19T14:53:48.9465581Z 	at org.apache.pulsar.io.AbstractPulsarE2ETest.shutdown(AbstractPulsarE2ETest.java:242)
2022-05-19T14:53:48.9466333Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.3/Native Method)
2022-05-19T14:53:48.9466952Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.3/NativeMethodAccessorImpl.java:77)
2022-05-19T14:53:48.9467634Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.3/DelegatingMethodAccessorImpl.java:43)
2022-05-19T14:53:48.9468197Z 	at java.lang.reflect.Method.invoke(java.base@17.0.3/Method.java:568)
2022-05-19T14:53:48.9468730Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
2022-05-19T14:53:48.9469439Z 	at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
2022-05-19T14:53:48.9470116Z 	at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
2022-05-19T14:53:48.9470704Z 	at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
2022-05-19T14:53:48.9471345Z 	at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
2022-05-19T14:53:48.9471915Z 	at org.testng.internal.TestInvoker.runAfterGroupsConfigurations(TestInvoker.java:677)
2022-05-19T14:53:48.9472460Z 	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:661)
2022-05-19T14:53:48.9472963Z 	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
2022-05-19T14:53:48.9473458Z 	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
2022-05-19T14:53:48.9473974Z 	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
2022-05-19T14:53:48.9474490Z 	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
2022-05-19T14:53:48.9478020Z 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
2022-05-19T14:53:48.9478585Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
2022-05-19T14:53:48.9479040Z 	at org.testng.TestRunner$$Lambda$167/0x0000000800d9e540.accept(Unknown Source)
2022-05-19T14:53:48.9479468Z 	at java.util.ArrayList.forEach(java.base@17.0.3/ArrayList.java:1511)
2022-05-19T14:53:48.9479867Z 	at org.testng.TestRunner.privateRun(TestRunner.java:764)
2022-05-19T14:53:48.9480253Z 	at org.testng.TestRunner.run(TestRunner.java:585)
2022-05-19T14:53:48.9480642Z 	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
2022-05-19T14:53:48.9481066Z 	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
2022-05-19T14:53:48.9481513Z 	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
2022-05-19T14:53:48.9481907Z 	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
2022-05-19T14:53:48.9482316Z 	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
2022-05-19T14:53:48.9482769Z 	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
2022-05-19T14:53:48.9483209Z 	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
2022-05-19T14:53:48.9483625Z 	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
2022-05-19T14:53:48.9484009Z 	at org.testng.TestNG.runSuites(TestNG.java:1069)
2022-05-19T14:53:48.9484357Z 	at org.testng.TestNG.run(TestNG.java:1037)
2022-05-19T14:53:48.9484786Z 	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
2022-05-19T14:53:48.9485480Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
2022-05-19T14:53:48.9486264Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
2022-05-19T14:53:48.9486992Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
2022-05-19T14:53:48.9487634Z 	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
2022-05-19T14:53:48.9488306Z 	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
2022-05-19T14:53:48.9489011Z 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
2022-05-19T14:53:48.9489791Z 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
2022-05-19T14:53:48.9490339Z 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

@mattisonchao
Copy link
Member

mattisonchao commented May 20, 2022

It looks like the BK client blocked on awaitUninterruptibly , do we have another operation to use eventLoop causing the promise can not complete by thread blocked?

@horizonzy
Copy link
Member

awaitUninterruptibly

Is there more stack info?

@Technoboy-
Copy link
Contributor

Another case:

2022-05-20T03:37:05.4960249Z "main" #1 prio=5 os_prio=0 cpu=32274.29ms elapsed=2566.54s tid=0x00007fd108024380 nid=0x1af8f waiting on condition  [0x00007fd10fcd0000]
2022-05-20T03:37:05.4960659Z    java.lang.Thread.State: WAITING (parking)
2022-05-20T03:37:05.4961114Z 	at jdk.internal.misc.Unsafe.park(java.base@17.0.3/Native Method)
2022-05-20T03:37:05.4961875Z 	- parking to wait for  <0x00000000cdf00010> (a java.util.concurrent.CompletableFuture$Signaller)
2022-05-20T03:37:05.4962343Z 	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.3/LockSupport.java:211)
2022-05-20T03:37:05.4963171Z 	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.3/CompletableFuture.java:1864)
2022-05-20T03:37:05.4963683Z 	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.3/ForkJoinPool.java:3463)
2022-05-20T03:37:05.4964169Z 	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.3/ForkJoinPool.java:3434)
2022-05-20T03:37:05.4964660Z 	at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.3/CompletableFuture.java:1898)
2022-05-20T03:37:05.4965158Z 	at java.util.concurrent.CompletableFuture.get(java.base@17.0.3/CompletableFuture.java:2072)
2022-05-20T03:37:05.4965715Z 	at org.apache.pulsar.broker.service.BrokerService.lambda$unloadNamespaceBundlesGracefully$21(BrokerService.java:919)
2022-05-20T03:37:05.4966467Z 	at org.apache.pulsar.broker.service.BrokerService$$Lambda$1164/0x0000000801527c70.accept(Unknown Source)
2022-05-20T03:37:05.4966882Z 	at java.lang.Iterable.forEach(java.base@17.0.3/Iterable.java:75)
2022-05-20T03:37:05.4967408Z 	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:911)
2022-05-20T03:37:05.4968078Z 	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:887)
2022-05-20T03:37:05.4968664Z 	at org.apache.pulsar.broker.service.BrokerService.closeAsync(BrokerService.java:732)
2022-05-20T03:37:05.4969579Z 	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:450)
2022-05-20T03:37:05.4970123Z 	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:372)
2022-05-20T03:37:05.4970720Z 	at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:182)
2022-05-20T03:37:05.4971338Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.3/Native Method)
2022-05-20T03:37:05.4971951Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.3/NativeMethodAccessorImpl.java:77)
2022-05-20T03:37:05.4972615Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.3/DelegatingMethodAccessorImpl.java:43)
2022-05-20T03:37:05.4973196Z 	at java.lang.reflect.Method.invoke(java.base@17.0.3/Method.java:568)
2022-05-20T03:37:05.4974028Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
2022-05-20T03:37:05.4974709Z 	at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
2022-05-20T03:37:05.4975404Z 	at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
2022-05-20T03:37:05.4976160Z 	at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
2022-05-20T03:37:05.4976700Z 	at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
2022-05-20T03:37:05.4977278Z 	at org.testng.internal.TestInvoker.runAfterGroupsConfigurations(TestInvoker.java:677)
2022-05-20T03:37:05.4977835Z 	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:661)
2022-05-20T03:37:05.4978285Z 	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
2022-05-20T03:37:05.4978810Z 	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
2022-05-20T03:37:05.4979341Z 	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
2022-05-20T03:37:05.4979863Z 	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
2022-05-20T03:37:05.4980424Z 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
2022-05-20T03:37:05.4980961Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
2022-05-20T03:37:05.4981462Z 	at org.testng.TestRunner$$Lambda$167/0x0000000800d9e540.accept(Unknown Source)
2022-05-20T03:37:05.4982018Z 	at java.util.ArrayList.forEach(java.base@17.0.3/ArrayList.java:1511)
2022-05-20T03:37:05.4982483Z 	at org.testng.TestRunner.privateRun(TestRunner.java:764)
2022-05-20T03:37:05.4982908Z 	at org.testng.TestRunner.run(TestRunner.java:585)
2022-05-20T03:37:05.4983341Z 	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
2022-05-20T03:37:05.4983784Z 	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
2022-05-20T03:37:05.4984558Z 	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
2022-05-20T03:37:05.4985005Z 	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
2022-05-20T03:37:05.4985984Z 	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
2022-05-20T03:37:05.4986529Z 	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
2022-05-20T03:37:05.4986974Z 	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
2022-05-20T03:37:05.4987336Z 	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
2022-05-20T03:37:05.4987682Z 	at org.testng.TestNG.runSuites(TestNG.java:1069)
2022-05-20T03:37:05.4987975Z 	at org.testng.TestNG.run(TestNG.java:1037)
2022-05-20T03:37:05.4988374Z 	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
2022-05-20T03:37:05.4988977Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
2022-05-20T03:37:05.4989654Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
2022-05-20T03:37:05.4990278Z 	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
2022-05-20T03:37:05.4990835Z 	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
2022-05-20T03:37:05.4991415Z 	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
2022-05-20T03:37:05.4992001Z 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
2022-05-20T03:37:05.4992518Z 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
2022-05-20T03:37:05.4992985Z 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

@lhotari
Copy link
Member Author

lhotari commented May 24, 2022

@Technoboy- The issue where the broker doesn't get killed is separate from the unloadNamespaceBundlesGracefully hanging forever. It's true that the cause might be the unloading hanging forever, but it's another problem that should be addressed. I noticed your PR, but that doesn't really deal with the underlying problem which causes the hanging.

@lhotari
Copy link
Member Author

lhotari commented May 24, 2022

@Technoboy- I reported #15753 about the unloading hanging.

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants