-
-
Notifications
You must be signed in to change notification settings - Fork 85
Description
ArcadeDB High Availability (HA) Subsystem Improvement Plan
Executive Summary
This document outlines a comprehensive plan to improve the ArcadeDB High Availability subsystem based on analysis of the feature/2043-ha-test branch compared to main. The branch introduces several key enhancements:
- ServerInfo Record - New structured representation for server addresses with alias support
- HACluster Class - Centralized cluster configuration management
- Docker/K8s Discovery - Alias mechanism for dynamic container environments
- Resilience Module - Testcontainers + Toxiproxy-based fault injection testing
However, several issues and incomplete implementations need attention before merging.
Current State Analysis
Changes from Main Branch
Total Changes: 67 files, +1,584 lines, -653 lines
Key Architectural Changes:
| Component | Main Branch | Feature Branch |
|---|---|---|
| Server Identifier | String (host:port) |
ServerInfo record (host, port, alias) |
| Cluster Configuration | Set<String> |
HACluster class |
| Replica Connections | Map<String, L2RNetworkExecutor> |
Map<ServerInfo, L2RNetworkExecutor> |
| Server Address | String serverAddress |
ServerInfo serverAddress |
| Enum Naming | SCREAMING_CASE |
PascalCase (Java standard) |
New Components
resilienceMaven Module - Dedicated module for network failure testingContainersTestTemplate- Base class for Testcontainers-based HA testsServerInfoRecord - Type-safe server address representationHAClusterClass - Cluster topology abstraction
Identified Issues
Critical Issues
1. Incomplete Alias Resolution in Server Discovery
Location: HAServer.java:1062, SimpleHaScenarioIT.java:29-30
The alias mechanism {arcade2}proxy:8667 is parsed but not fully resolved during cluster formation:
// Error seen in logs:
// Error connecting to the remote Leader server {proxy}proxy:8666
// (error=Invalid host proxy:8667{arcade3}proxy:8668)Root Cause: When building the server list, aliases are being concatenated incorrectly.
2. setServerAddresses() Method is Commented Out
Location: HAServer.java:540-560
The method that should update cluster configuration from replicas is entirely commented out, leaving dead code.
3. removeServer() Still Uses String
Location: HAServer.java:749
public void removeServer(final String remoteServerName) {
final Leader2ReplicaNetworkExecutor c = replicaConnections.remove(remoteServerName);This is incompatible with the new Map<ServerInfo, ...> signature and will cause ClassCastException.
4. HTTP Address Propagation Disabled
Location: HAServer.java:697-728
Methods setReplicasHTTPAddresses() and getReplicaServersHTTPAddressesList() are commented out, breaking client redirect functionality.
5. Test Assertions Wrong in ThreeInstancesScenarioIT
Location: ThreeInstancesScenarioIT.java:103-105
// When arcade1 is disconnected, arcade2 and arcade3 should have data
// But test asserts on arcade1 which is disconnected!
db1.assertThatUserCountIs(130); // arcade1 is DISCONNECTED - can't assert
db2.assertThatUserCountIs(130); // correct
db3.assertThatUserCountIs(130); // correctModerate Issues
6. ReplicationServerQuorumNoneIT Reliability
Location: ReplicationServerQuorumNoneIT.java:40-47
While timeout increases help, the fundamental issue is that async replication with QUORUM=NONE allows unbounded queue growth. The test reduces load but doesn't address the underlying design issue.
7. Missing ServerInfo equals/hashCode Consideration
Location: HAServer.java:83-97
The ServerInfo record should have explicit documentation about identity semantics. Currently it compares all three fields, but for cluster membership, only host:port should matter.
8. Thread Safety Concerns
Location: HAServer.java:82 (cluster field)
private HACluster cluster; // Not volatile, accessed from multiple threadsThe cluster field is modified in setServerAddresses() (when it works) and read during elections without synchronization.
Minor Issues
9. Debug Logging Left in Code
Location: HAServer.java:640, 1078, 1085
Several Level.INFO log statements should be Level.FINE:
LogManager.instance().log(this, Level.INFO, "Sending request (%s) to %s", ...);
LogManager.instance().log(this, Level.INFO, "Creating client connection to '%s'", ...);10. Commented-Out Code in ReplicationServerIT
Location: ReplicationServerIT.java:1 - File starts with ver /* (typo)