集群间数据不一致，日志中持续的抛出 Request cannot be executed； I/O reactor status: STOPPED #6367

yuzhaopeng · 2021-07-15T09:50:25Z

Describe the bug

请求会被转发到不健康的实例上
新的服务注册上去之后，在nacos控制台间歇性的看不到该服务，但是在服务侧日志显示注册成功
新注册的服务，集群之间不同的节点数据不一致，有的可以看到，有的看不到
日志中持续的输出 Request cannot be executed； I/O reactor status: STOPPED

Expected behavior

数据保持一致
请求不会被转发到不健康实例
新注册的服务可以在控制台看到

Acutally behavior

节点间数据不一致
请求会被转发到健康状态为false的实例
新注册的服务间歇的可以在控制台看到

How to Reproduce

持续运行了一个多月出现了这个问题

Desktop (please complete the following information):

OS: Kubernetes 1.17.3
Docker Image: nacos/nacos-server:1.4.0
Version : nacos-server:1.4.0
Module : naming

Additional context

naming-server.log 日志中，异常堆栈如下：

2021-07-13 17:33:57,768 WARN [STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: nacos-0.nacos.cloudpath-test.svc.cluster.local:8848

java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
        at org.apache.http.util.Asserts.check(Asserts.java:46)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)
        at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:75)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:108)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:92)
        at com.alibaba.nacos.common.http.client.request.DefaultAsyncHttpClientRequest.execute(DefaultAsyncHttpClientRequest.java:52)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.execute(NacosAsyncRestTemplate.java:364)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.post(NacosAsyncRestTemplate.java:260)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpPostLarge(HttpClient.java:227)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpPostLarge(HttpClient.java:210)
        at com.alibaba.nacos.naming.misc.ServiceStatusSynchronizer.send(ServiceStatusSynchronizer.java:55)
        at com.alibaba.nacos.naming.core.ServiceManager$ServiceReporter.run(ServiceManager.java:1126)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

realJackSun · 2021-07-16T03:37:18Z

从提供的日志上看，集群之间数据同步是有问题的，各台Nacos server之间间歇性地互相访问不通。检查一下网络？

yuzhaopeng · 2021-07-16T06:42:25Z

检查网络了，网络是畅通的，集群是三个节点的，只有一个节点有这个问题
通过阅读部分的Nacos源代码，我发现是 com.alibaba.nacos.naming.core.ServiceManager.ServiceReporter 类中发送异步HTTP请求的线程过早的结束了，导致HTTP请求无法发送出去。
发送异步HTTP请求的线程结束后，此时集群间的状态成为了：问题节点可以接收到其他节点发来的服务状态数据，但是问题节点自身的服务状态信息发送不出去，进而导致集群间数据不一致。
通过阅读源代码，发现数据同步的具体的执行逻辑大致如下：
1. ServiceReporter 会把每个命名空间下的所有服务状态信息发送给其他Nacos节点。
2. 而服务状态信息的发送是通过 com.alibaba.nacos.naming.misc.ServiceStatusSynchronizer类的 send() 方法来发送。
3. send() 方法内部调用了工具类 com.alibaba.nacos.naming.misc.HttpClient 的 asyncHttpPostLarge() 方法。
4. asyncHttpPostLarge() 发送异步的HTTP请求，内部实现是通过 NacosAsyncRestTemplate 工具类完成。
5. NacosAsyncRestTemplate 内部则是经过了多次的封装之后，最终依赖了 Apache HttpComponents 项目中的 httpasyncclient 组件，来完成异步请求。
6. 所以发送服务状态消息的HTTP请求最终是由 httpasyncclient 组件来完成，而本次故障就是 httpasyncclient 组件中执行异步任务的线程提前结束导致异步请求无法发送。抛出异常的代码如下：
```
  // 这段代码是截取自 org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase 类中
 // 这个方法是执行异步HTTP请求之前，检查线程状态，如果线程结束了，则抛出异常。
 protected void ensureRunning() {
     final Status currentStatus = this.status.get();
     //  这里抛出的异常信息，就是日志中持续输出的异常信息
     Asserts.check(currentStatus == Status.ACTIVE, "Request cannot be executed; " +
             "I/O reactor status: %s", currentStatus);
 }
```
1. 而 NacosAsyncRestTemplate 又是以单例模式在运行，httpasyncclient 的异步线程结束后，也不会在重新启动线程。
目前暂时不清楚执行异步HTTP请求的线程为什么会提前结束，但是在 NacosAsyncRestTemplate 发送异步请求时，是不是检查一下异步线程的状态会更稳妥一些呢

879913935 · 2021-07-17T04:16:54Z

这个问题存在很长时间了，很多地方集群建的通信或者自身的调用貌似都用到了这个NacosAsyncRestTemplate，但是它里面的HttpClient有可能会意外关闭，目前的逻辑不会自动恢复或者重新创建

 // CloseableHttpAsyncClientBase.java
 public CloseableHttpAsyncClientBase(
            final NHttpClientConnectionManager connmgr,
            final ThreadFactory threadFactory,
            final NHttpClientEventHandler handler) {
        super();
        this.connmgr = connmgr;
        if (threadFactory != null && handler != null) {
            this.reactorThread = threadFactory.newThread(new Runnable() {

                @Override
                public void run() {
                    try {
                        final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler);
                        connmgr.execute(ioEventDispatch);
                    } catch (final Exception ex) {
                        log.error("I/O reactor terminated abnormally", ex);
                    } finally {
                        // reactor意外挂了，就会直接变成stopped
                        status.set(Status.STOPPED);
                    }
                }

            });
        } else {
            this.reactorThread = null;
        }
        this.status = new AtomicReference<Status>(Status.INACTIVE);
    }

yuzhaopeng · 2021-07-19T02:32:18Z

NacosAsyncRestTemplate 的异步请求是 AsyncHttpClientRequest 来完成的。
是否可以在 AsyncHttpClientRequest 中解决一下这个问题，每次在发送请求之前，先判断一下异步线程状态，如果线程关闭了，就重新启动一下线程。
我看 CloseableHttpAsyncClientBase 类中提供了 isRunning() 运行状态判断的方法和 start() 线程启动的方法，可以实现这个需求。

KomachiSion · 2021-07-19T03:56:12Z

个人觉得，如果是意外关闭导致的IO异常，异常信息还是打印出来比较好。异步调用失败，NacosAsyncRestTemplate都会调用callback.onFail或者onException，nacos会对应处理。

如果判读一下isRunning，start 最多只是把日志内容换一下，但是持续出现对应日志依旧。

最后如果判断线程或链接已关闭就自己去启动线程是绝对不可取的操作。可能会导致极其严重的内存/连接泄漏问题。

yuzhaopeng · 2021-07-19T04:32:48Z

这个异常是请求发送之前抛出的异常，还没有走到回调的逻辑，所以不会触发onFail() 。

NacosAsyncRestTemplate 这个类使用的时候都是以单例模式来用的，一旦异步线程意外关闭，项目中所有使用 NacosAsyncRestTemplate 发送的请求都发送不出去了，这样的话可能就会使集群处在异常状态了，并产生了错误的行为（例如数据不一致，将请求转发到失效的实例上等问题）。

针对这种问题是否可以做一些其他的补救措施呢，因为目前的话，如果出现了这种情况，就只能重启问题节点。

KomachiSion · 2021-07-22T02:42:00Z

异步线程是httpclient中的实现，使用的时候好像不好判断，得看一下CloseableHttpAsyncClient是否有提供对应的机制。

我理解httpclient只要不关闭Client，下一次请求会重新补充连接。

yuzhaopeng · 2021-07-22T06:54:46Z

异步线程是httpclient中的实现，但是具体的线程启动逻辑是在 DefaultAsyncHttpClientRequest 的构造方法中调用的，因为是单例的，构造方法中的逻辑，只会在项目启动的时候执行一次，所以项目在运行期间线程关闭了，也不会启动线程。

我也认为只要没有关闭Client，如果线程挂了，应该需要主动处理一下，而不只是打印异常信息。

brotherlu-xcq · 2021-07-22T09:54:07Z

elastic/elasticsearch#49124 也遇到类似的问题，看样子应该是IOReactor意外退出导致的异常。我跟了一下源码，关键逻辑主要在org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor#execute。我们可以增加自定义的org.apache.http.nio.reactor.IOReactorExceptionHandler 来捕捉RuntimeException尝试修复这个问题。

Maijh97 · 2021-07-22T09:59:15Z

elastic/elasticsearch#49124 也遇到类似的问题，看样子应该是IOReactor意外退出导致的异常。我跟了一下源码，关键逻辑主要在org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor#execute。我们可以增加自定义的org.apache.http.nio.reactor.IOReactorExceptionHandler 来捕捉RuntimeException尝试修复这个问题。

👍🏻，你要是有时间，可以试着改一下这里，目前我这边事情比较多，有点搞不过来😁。提供一下我的思路：可以增加一个IOReactorExceptionHandler，然后在异常的时候，把旧的ayncClient shutdown掉，然后在重新创建一个新的asyncClient。
但是意外退出的异常，可能需要深究一下，看看具体因为什么原因导致IOReactor意外退出

brotherlu-xcq · 2021-07-22T10:01:27Z

可以，我尝试修复一下这个问题。

brotherlu-xcq · 2021-07-23T03:41:18Z

@yuzhaopeng 你好，可以发一下你的日志中有对应的I/O reactor terminated abnormally堆栈信息吗？这个是IOReactor意外退出的日志。我现在本地比较难复现这个问题，我想结合你的错误日志来验证一下我的修复逻辑。

yuzhaopeng · 2021-07-23T05:59:21Z

我搜索了现存的日志文件，没有找到 I/O reactor terminated abnormally 报错，可能已经被清理了

… loop when meet unknown Network error. (#6441) * Add IOExceptionHandler to handle the Exception form IOReactor. * remove the systemProperties for create ConnectionManager. * fix the javadoc.

warmonipa · 2021-08-04T07:15:46Z

hi @KomachiSion 1.4.x这个bug请一并修复一下吧

huyouba1 · 2022-07-05T06:42:16Z

2.0.4版本依旧存在问题

2022-07-05 10:22:57,850 WARN [serverlist] updated to : [Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=1656987673534, raftMetaData={metaDataMap={naming_instance_metadata={leader=IP1:7848, raftGroupMember=[IP1:7848, IP2:7848, IP3:7848], term=62}, naming_persistent_service={leader=IP1:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=35}, naming_persistent_service_v2={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=51}, naming_service_metadat
a={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848], term=52}}}, raftPort=7848, version=2.0.4}}, Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=
1656987636159, raftMetaData={metaDataMap={naming_instance_metadata={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=61}, naming_persistent_service
={errMsg=Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: StateMachine meet critical error when applying one or more tasks since index=4034, Status[ESTATEMACHINE<10002>: StateMachine
meet critical error: java.lang.IllegalArgumentException: No enum constant com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor.Op.
        at java.base/java.lang.Enum.valueOf(Enum.java:240)
        at com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor$Op.valueOf(BasePersistentServiceProcessor.java:63)
        at com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor.onApply(BasePersistentServiceProcessor.java:170)
        at com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onApply(NacosStateMachine.java:115)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:541)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:510)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:442)
        at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
        at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)
        at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
        at java.base/java.lang.Thread.run(Thread.java:834)
.]]], leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=35}, naming_persistent_service_v2={leader=IP:7848, raftGroupMember=[IP
:7848, IP:7848, IP:7848], term=50}, naming_service_metadata={leader=IP:7848, raftGroupMember=[IP:7848, IP:7848, IP:7848], term=52}}}, raftPort
=7848, site=unknown, version=2.0.4, weight=8}}, Member{ip='IP', port=8848, state=UP, extendInfo={lastRefreshTime=1656987529759, raftMetaData=com.alibaba.nacos.consistency.ProtocolMetaData@509cccd0,
 raftPort=7848, version=2.0.4}}]

2022-07-05 10:22:57,854 WARN [serverlist] updated to : [Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848}}, Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848}}
, Member{ip='IP', port=8848, state=UP, extendInfo={raftPort=7848, version=2.0.4}}]

2022-07-05 10:25:04,576 ERROR failed to report new info to target node : IP:8848, error : caused: Request cannot be executed; I/O reactor status: STOPPED;

2022-07-05 10:25:19,926 INFO Current addressing mode selection : FileConfigMemberLookup

集群节点数据不一致

realJackSun added the status/need feedback label Jul 16, 2021

KomachiSion assigned Maijh97 Jul 22, 2021

brotherlu-xcq mentioned this issue Jul 23, 2021

[ISSUE #6367] Add IOReactorExceptionHandler to avoid IOReactor out of loop when meet unknown Network error. #6441

Merged

5 tasks

KomachiSion added kind/research kind/enhancement Category issues or prs related to enhancement. and removed status/need feedback labels Jul 26, 2021

Maijh97 closed this as completed in #6441 Jul 27, 2021

KomachiSion added this to the 2.0.3 milestone Jul 28, 2021

hoErYu mentioned this issue Sep 16, 2022

服务实例被kill后在实例列表中一直存在 #9153

Closed

KANLON mentioned this issue Dec 13, 2022

The log of the master node keeps reporting exceptions： I/O reactor status: STOPPED #5931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

集群间数据不一致，日志中持续的抛出 Request cannot be executed； I/O reactor status: STOPPED #6367

集群间数据不一致，日志中持续的抛出 Request cannot be executed； I/O reactor status: STOPPED #6367

yuzhaopeng commented Jul 15, 2021 •

edited

Loading

realJackSun commented Jul 16, 2021

yuzhaopeng commented Jul 16, 2021

879913935 commented Jul 17, 2021

yuzhaopeng commented Jul 19, 2021

KomachiSion commented Jul 19, 2021

yuzhaopeng commented Jul 19, 2021

KomachiSion commented Jul 22, 2021

yuzhaopeng commented Jul 22, 2021

brotherlu-xcq commented Jul 22, 2021

Maijh97 commented Jul 22, 2021

brotherlu-xcq commented Jul 22, 2021

brotherlu-xcq commented Jul 23, 2021

yuzhaopeng commented Jul 23, 2021

warmonipa commented Aug 4, 2021

huyouba1 commented Jul 5, 2022

集群间数据不一致，日志中持续的抛出 Request cannot be executed； I/O reactor status: STOPPED #6367

集群间数据不一致，日志中持续的抛出 Request cannot be executed； I/O reactor status: STOPPED #6367

Comments

yuzhaopeng commented Jul 15, 2021 • edited Loading

realJackSun commented Jul 16, 2021

yuzhaopeng commented Jul 16, 2021

879913935 commented Jul 17, 2021

yuzhaopeng commented Jul 19, 2021

KomachiSion commented Jul 19, 2021

yuzhaopeng commented Jul 19, 2021

KomachiSion commented Jul 22, 2021

yuzhaopeng commented Jul 22, 2021

brotherlu-xcq commented Jul 22, 2021

Maijh97 commented Jul 22, 2021

brotherlu-xcq commented Jul 22, 2021

brotherlu-xcq commented Jul 23, 2021

yuzhaopeng commented Jul 23, 2021

warmonipa commented Aug 4, 2021

huyouba1 commented Jul 5, 2022

yuzhaopeng commented Jul 15, 2021 •

edited

Loading