Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Found one Java-level deadlock #12576

Closed
bes2008 opened this issue Sep 2, 2024 · 11 comments
Closed

Found one Java-level deadlock #12576

bes2008 opened this issue Sep 2, 2024 · 11 comments

Comments

@bes2008
Copy link

bes2008 commented Sep 2, 2024

Describe the bug
启动后,访问页面不出来,通过jstack发现了 deadlock

Expected behavior
正常启动

Actually behavior
出现死锁

How to Reproduce
不是稳定复现,使用的是mysql。
2.4.1 版本

死锁线程栈


Found one Java-level deadlock:
=============================
"nacos.plugin.control.connection.reporter":
  waiting to lock monitor 0x00007f5970643548 (object 0x00000000f044f980, a java.util.concurrent.ConcurrentHashMap),
  which is held by "main"
"main":
  waiting to lock monitor 0x00007f59702e3588 (object 0x00000000ecf7bea8, a com.alibaba.nacos.core.distributed.ProtocolManager),
  which is held by "com.alibaba.nacos.naming.timer.0"
"com.alibaba.nacos.naming.timer.0":
  waiting to lock monitor 0x00007f5970643548 (object 0x00000000f044f980, a java.util.concurrent.ConcurrentHashMap),
  which is held by "main"

Java stack information for the threads listed above:
===================================================
"nacos.plugin.control.connection.reporter":
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:186)
        - waiting to lock <0x00000000f044f980> (a java.util.concurrent.ConcurrentHashMap)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:168)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:257)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:234)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveNamedBean(DefaultListableBeanFactory.java:1284)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveNamedBean(DefaultListableBeanFactory.java:1245)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveBean(DefaultListableBeanFactory.java:494)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBean(DefaultListableBeanFactory.java:349)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBean(DefaultListableBeanFactory.java:342)
        at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1189)
        at com.alibaba.nacos.sys.utils.ApplicationUtils.getBean(ApplicationUtils.java:150)
        at com.alibaba.nacos.core.remote.LongConnectionMetricsCollector.getTotalCount(LongConnectionMetricsCollector.java:38)
        at com.alibaba.nacos.plugin.control.connection.ConnectionControlManager$ConnectionMetricsReporter$$Lambda$792/946746156.apply(Unknown Source)
        at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
        at java.util.stream.Collectors$$Lambda$663/592248663.accept(Unknown Source)
        at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at com.alibaba.nacos.plugin.control.connection.ConnectionControlManager$ConnectionMetricsReporter.run(ConnectionControlManager.java:138)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
"main":
        at com.alibaba.nacos.core.distributed.ProtocolManager.getCpProtocol(ProtocolManager.java:84)
        - waiting to lock <0x00000000ecf7bea8> (a com.alibaba.nacos.core.distributed.ProtocolManager)
        at com.alibaba.nacos.naming.misc.SwitchManager.<init>(SwitchManager.java:95)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:213)
        at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:117)
        at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:302)
        at org.springframework.beans.factory.support.ConstructorResolver.autowireConstructor(ConstructorResolver.java:287)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.autowireConstructor(AbstractAutowireCapableBeanFactory.java:1372)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1222)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:582)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542)
        at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:336)
        at org.springframework.beans.factory.support.AbstractBeanFactory$$Lambda$199/1551446957.getObject(Unknown Source)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234)
        - locked <0x00000000f044f980> (a java.util.concurrent.ConcurrentHashMap)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:209)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:955)
        at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:932)
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:591)
        - locked <0x00000000f00bd368> (a java.lang.Object)
        at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:147)
        at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:732)
        at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:409)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:308)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:1300)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:1289)
        at com.alibaba.nacos.Nacos.main(Nacos.java:46)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
        at org.springframework.boot.loader.PropertiesLauncher.main(PropertiesLauncher.java:467)
"com.alibaba.nacos.naming.timer.0":
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:217)
        - waiting to lock <0x00000000f044f980> (a java.util.concurrent.ConcurrentHashMap)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:334)
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:234)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveNamedBean(DefaultListableBeanFactory.java:1284)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveNamedBean(DefaultListableBeanFactory.java:1245)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveBean(DefaultListableBeanFactory.java:494)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBean(DefaultListableBeanFactory.java:349)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBean(DefaultListableBeanFactory.java:342)
        at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1189)
        at com.alibaba.nacos.sys.utils.ApplicationUtils.getBean(ApplicationUtils.java:150)
        at com.alibaba.nacos.core.distributed.ProtocolManager.lambda$initCPProtocol$3(ProtocolManager.java:126)
        at com.alibaba.nacos.core.distributed.ProtocolManager$$Lambda$522/453330704.accept(Unknown Source)
        at com.alibaba.nacos.sys.utils.ApplicationUtils.getBeanIfExist(ApplicationUtils.java:156)
        at com.alibaba.nacos.core.distributed.ProtocolManager.initCPProtocol(ProtocolManager.java:124)
        at com.alibaba.nacos.core.distributed.ProtocolManager.getCpProtocol(ProtocolManager.java:85)
        - locked <0x00000000ecf7bea8> (a com.alibaba.nacos.core.distributed.ProtocolManager)
        at com.alibaba.nacos.naming.cluster.ServerStatusManager.hasLeader(ServerStatusManager.java:70)
        at com.alibaba.nacos.naming.cluster.ServerStatusManager.refreshServerStatus(ServerStatusManager.java:62)
        at com.alibaba.nacos.naming.cluster.ServerStatusManager.access$000(ServerStatusManager.java:37)
        at com.alibaba.nacos.naming.cluster.ServerStatusManager$ServerStatusUpdater.run(ServerStatusManager.java:92)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
@bes2008
Copy link
Author

bes2008 commented Sep 2, 2024

@HMYDK
#12584 的提交,并不能解决这个问题,你仔细研究一下 stack 就能明白

@HMYDK
Copy link
Contributor

HMYDK commented Sep 3, 2024

@HMYDK #12584 的提交,并不能解决这个问题,你仔细研究一下 stack 就能明白

这个pr解决的是:避免ProtocolManager对象在初始化和使用时产生竞争。
降低锁粒度,也是避免死锁的一种方式。

image @KomachiSion `protocolManager.getCpProtocol()`原有的实现会锁住整个protocolManager对象,上述的pr对锁粒度做了优化。麻烦review下是否有问题。

@bes2008
Copy link
Author

bes2008 commented Sep 3, 2024

@HMYDK
你那个修改,只是减小了加锁范围,也是有意义的代码。
但是它并没有解除 两个线程对两个对象的竞争,因而不能解决这个 死锁问题。

@HMYDK
Copy link
Contributor

HMYDK commented Sep 3, 2024

@HMYDK 你那个修改,只是减小了加锁范围,也是有意义的代码。 但是它并没有解除 两个线程对两个对象的竞争,因而不能解决这个 死锁问题。

是的,没有从根本解决问题。只是减少了可能性。

@KomachiSion
Copy link
Collaborator

KomachiSion commented Sep 3, 2024

ServerStatusManager 的死锁问题已经在 #12526#12573 中解决了,理论上这个问题已经解决了。

可以使用最新的分支进行测试一下。

SwtichManager本身是spring bean,在构造时依赖其他的spring bean没有问题,

主要是ServerStatusManager的异步校验任务的时候也需要ProtocolManager,这就造成了问题。

而connection.reporter只是因为前两者已经锁住的情况下,等待main线程加载完成并释放锁,实际上不参与竞争。

@bes2008
Copy link
Author

bes2008 commented Sep 3, 2024

@KomachiSion
是的,onnection.reporter并不参与竞争,主要是 ServerStatusManager 与 Spring Bean初始化在两个线程中导致的。


看了 #12573 的代码,在ServerStatusManager中 加上了如下代码:

    private boolean isReady() {
        if (!globalConfig.isDataWarmup()) {
            return true;
        }
        if (!protocolManager.isCpInit() || protocolManager.getCpProtocol() == null) {
            return false;
        }
        return protocolManager.getCpProtocol().isReady() && distroProtocol.isInitialized();
    }

从代码上看,是可以避免了deadlock了。

@bes2008
Copy link
Author

bes2008 commented Sep 3, 2024

@KomachiSion
另外,我看 #12573 提交是放到了 2.4.2 的版本中了,预计什么时间发布,这个影响还是蛮大的?

@KomachiSion
Copy link
Collaborator

@KomachiSion 另外,我看 #12573 提交是放到了 2.4.2 的版本中了,预计什么时间发布,这个影响还是蛮大的?

预计这周就会发布, 启动死锁的问题会尽快解决发版的。

@KomachiSion
Copy link
Collaborator

@KomachiSion 另外,我看 #12573 提交是放到了 2.4.2 的版本中了,预计什么时间发布,这个影响还是蛮大的?

欢迎使用develop分支多测试一下, 我自己测试了10次,已经不会造成启动死锁了, 不过2.4.1版本我自己环境运行也概率很低,可能需要别人的环境里多测试下。

@bes2008
Copy link
Author

bes2008 commented Sep 5, 2024

@KomachiSion

测试了10次,没有发现启动死锁情况

@KomachiSion
Copy link
Collaborator

Fixed in 2.4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants