Skip to content

[Enhancement] Improve Namesrv health check: auto-close unhealthy channel on repeated failures #9290

Closed
@Kris20030907

Description

@Kris20030907

Before Creating the Enhancement Request

  • I have confirmed that this should be classified as an enhancement rather than a bug/feature.

Summary

Solution: Add a configurable switch to the Client, add health detection for namesrv, and record the number of consecutive failures. When the number reaches the set value, actively call nettyRemotingClient to disconnect the channel of the currently selected namesrv to avoid the situation where the channel is hung for a long time and refresh the connection. At the same time, it can also ensure that in the event of an abnormality in the network of a certain namesrv, re-initiate the connection to other available namesrv in the namesrv list to work normally.

Motivation

Problem scenario: After the Client (Producer, Consumer) establishes a connection with Namesrv, it will regularly pull TopicRouteInfo from namesrv. However, once the machine where namesrv is located is disconnected from the Internet (TCP level), or the network fluctuates for a long time, or the firewall jitters, the application layer channel cannot be detected as abnormal and will not be disconnected. All requests based on this connection will not be able to obtain data normally due to timeout.

Describe the Solution You'd Like

  1. Add a client-level switch (such as enableNamesrvCheck) to turn on or off the Namesrv health check function.
  2. After calling updateTopicRouteInfoFromNameServer() in the scheduled task, add logic:
    • If the call is successful, reset the continuous failure count (namesrvHealthCheckFailCount) to 0.
    • If the call throws an exception, increase the failure count; when the failure count reaches the threshold set by clientConfig.getMaxClientNamesrvCheckFailedCnt(),
      actively call NettyRemotingClient.closeUnHealthyNamesrvChannel() to disconnect the currently selected Namesrv channel and reset the failure count to 0.
  3. This improvement ensures that only continuous failures will trigger the disconnection operation, and the failure count will be immediately reset to zero after Namesrv returns to normal, ensuring that the disconnection logic will not be triggered by mistake.
  4. In addition, when a Namesrv exception is detected, the mechanism can automatically refresh the routing information and try to connect to other available nodes in the Namesrv list, thereby improving the availability of the overall system.

Describe Alternatives You've Considered

  1. Rely only on the existing IdleStateHandler and TCP layer SO_KEEPALIVE: Since the default OS keepalive time interval is too long, and IdleStateHandler cannot directly detect TCP disconnection, it is difficult to meet real-time requirements by relying on these mechanisms alone.
  2. Force the use of short-timeout RPC calls to detect anomalies: Although this method can detect anomalies to a certain extent, it may increase the misjudgment rate under normal network fluctuations and affect the success rate of business requests.
  3. Use TCP_USER_TIMEOUT: Although this parameter can detect network disconnection faster, it is not available or inconvenient to modify in the current environment, and may introduce cross-platform compatibility issues.
    Therefore, adding a health detection switch and recording the number of consecutive failures, and then actively disconnecting unhealthy Namesrv connections, becomes a more reliable and flexible solution.

Additional Context

This improvement has been verified through local testing. During Namesrv abnormalities, it can accurately record the number of consecutive failures and actively disconnect unhealthy channels after reaching the set threshold. When Namesrv recovers, the scheduled task can reset the failure count normally and continue to update the routing information.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions