Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

集群间数据不一致,日志中持续的抛出 Request cannot be executed; I/O reactor status: STOPPED #6367

Closed
yuzhaopeng opened this issue Jul 15, 2021 · 15 comments · Fixed by #6441
Assignees
Labels
kind/enhancement Category issues or prs related to enhancement. kind/research
Milestone

Comments

@yuzhaopeng
Copy link

yuzhaopeng commented Jul 15, 2021

Describe the bug

  1. 请求会被转发到不健康的实例上
  2. 新的服务注册上去之后,在nacos控制台间歇性的看不到该服务,但是在服务侧日志显示注册成功
  3. 新注册的服务,集群之间不同的节点数据不一致,有的可以看到,有的看不到
  4. 日志中持续的输出 Request cannot be executed; I/O reactor status: STOPPED

Expected behavior

  1. 数据保持一致
  2. 请求不会被转发到不健康实例
  3. 新注册的服务可以在控制台看到

Acutally behavior

  1. 节点间数据不一致
  2. 请求会被转发到健康状态为false的实例
  3. 新注册的服务间歇的可以在控制台看到

How to Reproduce

  1. 持续运行了一个多月出现了这个问题

Desktop (please complete the following information):

  • OS: Kubernetes 1.17.3
  • Docker Image: nacos/nacos-server:1.4.0
  • Version : nacos-server:1.4.0
  • Module : naming

Additional context

naming-server.log 日志中,异常堆栈如下:

2021-07-13 17:33:57,768 WARN [STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: nacos-0.nacos.cloudpath-test.svc.cluster.local:8848

java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
        at org.apache.http.util.Asserts.check(Asserts.java:46)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)
        at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:75)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:108)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:92)
        at com.alibaba.nacos.common.http.client.request.DefaultAsyncHttpClientRequest.execute(DefaultAsyncHttpClientRequest.java:52)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.execute(NacosAsyncRestTemplate.java:364)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.post(NacosAsyncRestTemplate.java:260)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpPostLarge(HttpClient.java:227)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpPostLarge(HttpClient.java:210)
        at com.alibaba.nacos.naming.misc.ServiceStatusSynchronizer.send(ServiceStatusSynchronizer.java:55)
        at com.alibaba.nacos.naming.core.ServiceManager$ServiceReporter.run(ServiceManager.java:1126)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
@realJackSun
Copy link
Collaborator

从提供的日志上看,集群之间数据同步是有问题的,各台Nacos server之间间歇性地互相访问不通。检查一下网络?

@yuzhaopeng
Copy link
Author

  1. 检查网络了,网络是畅通的,集群是三个节点的,只有一个节点有这个问题

  2. 通过阅读部分的Nacos源代码,我发现是 com.alibaba.nacos.naming.core.ServiceManager.ServiceReporter 类中发送异步HTTP请求的线程过早的结束了,导致HTTP请求无法发送出去。

  3. 发送异步HTTP请求的线程结束后,此时集群间的状态成为了:问题节点可以接收到其他节点发来的服务状态数据,但是问题节点自身的服务状态信息发送不出去,进而导致集群间数据不一致。

  4. 通过阅读源代码,发现数据同步的具体的执行逻辑大致如下:

    1. ServiceReporter 会把每个命名空间下的所有服务状态信息发送给其他Nacos节点。
    2. 而服务状态信息的发送是通过 com.alibaba.nacos.naming.misc.ServiceStatusSynchronizer类的 send() 方法来发送。
    3. send() 方法内部调用了工具类 com.alibaba.nacos.naming.misc.HttpClientasyncHttpPostLarge() 方法。
    4. asyncHttpPostLarge() 发送异步的HTTP请求,内部实现是通过 NacosAsyncRestTemplate 工具类完成。
    5. NacosAsyncRestTemplate 内部则是经过了多次的封装之后,最终依赖了 Apache HttpComponents 项目中的 httpasyncclient 组件,来完成异步请求。
    6. 所以发送服务状态消息的HTTP请求最终是由 httpasyncclient 组件来完成,而本次故障就是 httpasyncclient 组件中执行异步任务的线程提前结束导致异步请求无法发送。抛出异常的代码如下:
      // 这段代码是截取自 org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase 类中
     // 这个方法是执行异步HTTP请求之前,检查线程状态,如果线程结束了,则抛出异常。
     protected void ensureRunning() {
         final Status currentStatus = this.status.get();
         //  这里抛出的异常信息,就是日志中持续输出的异常信息
         Asserts.check(currentStatus == Status.ACTIVE, "Request cannot be executed; " +
                 "I/O reactor status: %s", currentStatus);
     }
    1. NacosAsyncRestTemplate 又是以单例模式在运行,httpasyncclient 的异步线程结束后,也不会在重新启动线程。
  5. 目前暂时不清楚执行异步HTTP请求的线程为什么会提前结束,但是在 NacosAsyncRestTemplate 发送异步请求时,是不是检查一下异步线程的状态会更稳妥一些呢

@879913935
Copy link

这个问题存在很长时间了,很多地方集群建的通信或者自身的调用貌似都用到了这个NacosAsyncRestTemplate,但是它里面的HttpClient有可能会意外关闭,目前的逻辑不会自动恢复或者重新创建

 // CloseableHttpAsyncClientBase.java
 public CloseableHttpAsyncClientBase(
            final NHttpClientConnectionManager connmgr,
            final ThreadFactory threadFactory,
            final NHttpClientEventHandler handler) {
        super();
        this.connmgr = connmgr;
        if (threadFactory != null && handler != null) {
            this.reactorThread = threadFactory.newThread(new Runnable() {

                @Override
                public void run() {
                    try {
                        final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler);
                        connmgr.execute(ioEventDispatch);
                    } catch (final Exception ex) {
                        log.error("I/O reactor terminated abnormally", ex);
                    } finally {
                        // reactor意外挂了,就会直接变成stopped
                        status.set(Status.STOPPED);
                    }
                }

            });
        } else {
            this.reactorThread = null;
        }
        this.status = new AtomicReference<Status>(Status.INACTIVE);
    }

@yuzhaopeng
Copy link
Author

  1. NacosAsyncRestTemplate 的异步请求是 AsyncHttpClientRequest 来完成的。
  2. 是否可以在 AsyncHttpClientRequest 中解决一下这个问题,每次在发送请求之前,先判断一下异步线程状态,如果线程关闭了,就重新启动一下线程。
  3. 我看 CloseableHttpAsyncClientBase 类中提供了 isRunning() 运行状态判断的方法和 start() 线程启动的方法,可以实现这个需求。

@KomachiSion
Copy link
Collaborator

个人觉得,如果是意外关闭导致的IO异常,异常信息还是打印出来比较好。异步调用失败,NacosAsyncRestTemplate都会调用callback.onFail或者onException,nacos会对应处理。

如果判读一下isRunning,start 最多只是把日志内容换一下,但是持续出现对应日志依旧。

最后如果判断线程或链接已关闭就自己去启动线程是绝对不可取的操作。可能会导致极其严重的内存/连接泄漏问题。

@yuzhaopeng
Copy link
Author

这个异常是请求发送之前抛出的异常,还没有走到回调的逻辑,所以不会触发onFail() 。

NacosAsyncRestTemplate 这个类使用的时候都是以单例模式来用的,一旦异步线程意外关闭,项目中所有使用 NacosAsyncRestTemplate 发送的请求都发送不出去了,这样的话可能就会使集群处在异常状态了,并产生了错误的行为(例如数据不一致,将请求转发到失效的实例上等问题)。

针对这种问题是否可以做一些其他的补救措施呢,因为目前的话,如果出现了这种情况,就只能重启问题节点。

@KomachiSion
Copy link
Collaborator

异步线程是httpclient中的实现,使用的时候好像不好判断,得看一下CloseableHttpAsyncClient是否有提供对应的机制。

我理解httpclient只要不关闭Client,下一次请求会重新补充连接。

@yuzhaopeng
Copy link
Author

异步线程是httpclient中的实现,但是具体的线程启动逻辑是在 DefaultAsyncHttpClientRequest 的构造方法中调用的,因为是单例的,构造方法中的逻辑,只会在项目启动的时候执行一次,所以项目在运行期间线程关闭了,也不会启动线程。

我也认为只要没有关闭Client,如果线程挂了,应该需要主动处理一下,而不只是打印异常信息。

@brotherlu-xcq
Copy link
Collaborator

elastic/elasticsearch#49124 也遇到类似的问题,看样子应该是IOReactor意外退出导致的异常。我跟了一下源码,关键逻辑主要在org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor#execute。我们可以增加自定义的org.apache.http.nio.reactor.IOReactorExceptionHandler 来捕捉RuntimeException尝试修复这个问题。

@Maijh97
Copy link
Collaborator

Maijh97 commented Jul 22, 2021

elastic/elasticsearch#49124 也遇到类似的问题,看样子应该是IOReactor意外退出导致的异常。我跟了一下源码,关键逻辑主要在org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor#execute。我们可以增加自定义的org.apache.http.nio.reactor.IOReactorExceptionHandler 来捕捉RuntimeException尝试修复这个问题。

👍🏻,你要是有时间,可以试着改一下这里,目前我这边事情比较多,有点搞不过来😁。提供一下我的思路:可以增加一个IOReactorExceptionHandler,然后在异常的时候,把旧的ayncClient shutdown掉,然后在重新创建一个新的asyncClient。
但是意外退出的异常,可能需要深究一下,看看具体因为什么原因导致IOReactor意外退出

@brotherlu-xcq
Copy link
Collaborator

可以,我尝试修复一下这个问题。

@brotherlu-xcq
Copy link
Collaborator

@yuzhaopeng 你好,可以发一下你的日志中有对应的I/O reactor terminated abnormally堆栈信息吗?这个是IOReactor意外退出的日志。我现在本地比较难复现这个问题,我想结合你的错误日志来验证一下我的修复逻辑。

@yuzhaopeng
Copy link
Author

我搜索了现存的日志文件,没有找到 I/O reactor terminated abnormally 报错,可能已经被清理了

@KomachiSion KomachiSion added kind/research kind/enhancement Category issues or prs related to enhancement. and removed status/need feedback labels Jul 26, 2021
Maijh97 pushed a commit that referenced this issue Jul 27, 2021
… loop when meet unknown Network error. (#6441)

* Add IOExceptionHandler to handle the Exception form IOReactor.

* remove the systemProperties for create ConnectionManager.

* fix the javadoc.
@KomachiSion KomachiSion added this to the 2.0.3 milestone Jul 28, 2021
@warmonipa
Copy link

hi @KomachiSion 1.4.x这个bug请一并修复一下吧

@huyouba1
Copy link

huyouba1 commented Jul 5, 2022

2.0.4版本依旧存在问题

2022-07-05 10:22:57,850 WARN [serverlist] updated to : [Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=1656987673534, raftMetaData={metaDataMap={naming_instance_metadata={leader=IP1:7848, raftGroupMember=[IP1:7848, IP2:7848, IP3:7848], term=62}, naming_persistent_service={leader=IP1:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=35}, naming_persistent_service_v2={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=51}, naming_service_metadat
a={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848], term=52}}}, raftPort=7848, version=2.0.4}}, Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=
1656987636159, raftMetaData={metaDataMap={naming_instance_metadata={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=61}, naming_persistent_service
={errMsg=Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: StateMachine meet critical error when applying one or more tasks since index=4034, Status[ESTATEMACHINE<10002>: StateMachine
meet critical error: java.lang.IllegalArgumentException: No enum constant com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor.Op.
        at java.base/java.lang.Enum.valueOf(Enum.java:240)
        at com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor$Op.valueOf(BasePersistentServiceProcessor.java:63)
        at com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor.onApply(BasePersistentServiceProcessor.java:170)
        at com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onApply(NacosStateMachine.java:115)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:541)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:510)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:442)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
        at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)
        at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
        at java.base/java.lang.Thread.run(Thread.java:834)
.]]], leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=35}, naming_persistent_service_v2={leader=IP:7848, raftGroupMember=[IP
:7848, IP:7848, IP:7848], term=50}, naming_service_metadata={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=52}}}, raftPort
=7848, site=unknown, version=2.0.4, weight=8}}, Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=1656987529759, raftMetaData=com.alibaba.nacos.consistency.ProtocolMetaData@509cccd0,
 raftPort=7848, version=2.0.4}}]

2022-07-05 10:22:57,854 WARN [serverlist] updated to : [Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848}}, Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848}}
, Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848, version=2.0.4}}]

2022-07-05 10:25:04,576 ERROR failed to report new info to target node : IP:8848, error : caused: Request cannot be executed; I/O reactor status: STOPPED;

2022-07-05 10:25:19,926 INFO Current addressing mode selection : FileConfigMemberLookup

集群节点数据不一致

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Category issues or prs related to enhancement. kind/research
Projects
None yet
8 participants