Skip to content

High Availability (HA) Subsystem Improvement Plan #2043

@robfrank

Description

@robfrank

ArcadeDB High Availability (HA) Subsystem Improvement Plan

Executive Summary

This document outlines a comprehensive plan to improve the ArcadeDB High Availability subsystem based on analysis of the feature/2043-ha-test branch compared to main. The branch introduces several key enhancements:

  1. ServerInfo Record - New structured representation for server addresses with alias support
  2. HACluster Class - Centralized cluster configuration management
  3. Docker/K8s Discovery - Alias mechanism for dynamic container environments
  4. Resilience Module - Testcontainers + Toxiproxy-based fault injection testing

However, several issues and incomplete implementations need attention before merging.


Current State Analysis

Changes from Main Branch

Total Changes: 67 files, +1,584 lines, -653 lines

Key Architectural Changes:

Component Main Branch Feature Branch
Server Identifier String (host:port) ServerInfo record (host, port, alias)
Cluster Configuration Set<String> HACluster class
Replica Connections Map<String, L2RNetworkExecutor> Map<ServerInfo, L2RNetworkExecutor>
Server Address String serverAddress ServerInfo serverAddress
Enum Naming SCREAMING_CASE PascalCase (Java standard)

New Components

  1. resilience Maven Module - Dedicated module for network failure testing
  2. ContainersTestTemplate - Base class for Testcontainers-based HA tests
  3. ServerInfo Record - Type-safe server address representation
  4. HACluster Class - Cluster topology abstraction

Identified Issues

Critical Issues

1. Incomplete Alias Resolution in Server Discovery

Location: HAServer.java:1062, SimpleHaScenarioIT.java:29-30

The alias mechanism {arcade2}proxy:8667 is parsed but not fully resolved during cluster formation:

// Error seen in logs:
// Error connecting to the remote Leader server {proxy}proxy:8666
// (error=Invalid host proxy:8667{arcade3}proxy:8668)

Root Cause: When building the server list, aliases are being concatenated incorrectly.

2. setServerAddresses() Method is Commented Out

Location: HAServer.java:540-560

The method that should update cluster configuration from replicas is entirely commented out, leaving dead code.

3. removeServer() Still Uses String

Location: HAServer.java:749

public void removeServer(final String remoteServerName) {
    final Leader2ReplicaNetworkExecutor c = replicaConnections.remove(remoteServerName);

This is incompatible with the new Map<ServerInfo, ...> signature and will cause ClassCastException.

4. HTTP Address Propagation Disabled

Location: HAServer.java:697-728

Methods setReplicasHTTPAddresses() and getReplicaServersHTTPAddressesList() are commented out, breaking client redirect functionality.

5. Test Assertions Wrong in ThreeInstancesScenarioIT

Location: ThreeInstancesScenarioIT.java:103-105

// When arcade1 is disconnected, arcade2 and arcade3 should have data
// But test asserts on arcade1 which is disconnected!
db1.assertThatUserCountIs(130);  // arcade1 is DISCONNECTED - can't assert
db2.assertThatUserCountIs(130);  // correct
db3.assertThatUserCountIs(130);  // correct

Moderate Issues

6. ReplicationServerQuorumNoneIT Reliability

Location: ReplicationServerQuorumNoneIT.java:40-47

While timeout increases help, the fundamental issue is that async replication with QUORUM=NONE allows unbounded queue growth. The test reduces load but doesn't address the underlying design issue.

7. Missing ServerInfo equals/hashCode Consideration

Location: HAServer.java:83-97

The ServerInfo record should have explicit documentation about identity semantics. Currently it compares all three fields, but for cluster membership, only host:port should matter.

8. Thread Safety Concerns

Location: HAServer.java:82 (cluster field)

private HACluster cluster;  // Not volatile, accessed from multiple threads

The cluster field is modified in setServerAddresses() (when it works) and read during elections without synchronization.

Minor Issues

9. Debug Logging Left in Code

Location: HAServer.java:640, 1078, 1085

Several Level.INFO log statements should be Level.FINE:

LogManager.instance().log(this, Level.INFO, "Sending request (%s) to %s", ...);
LogManager.instance().log(this, Level.INFO, "Creating client connection to '%s'", ...);

10. Commented-Out Code in ReplicationServerIT

Location: ReplicationServerIT.java:1 - File starts with ver /* (typo)


Improvement Plan

Phase 1: Fix Critical Bugs (Priority: HIGH)

Task 1.1: Fix Alias Resolution

Task 1.2: Fix removeServer() Type Mismatch

Task 1.3: Re-enable HTTP Address Propagation

Phase 2: Complete ServerInfo Migration (Priority: HIGH)

Task 2.1: Update All Server Identifier Usage

Task 2.2: Update UpdateClusterConfiguration

Task 2.3: Implement setServerAddresses Properly

Phase 3: Improve Docker/K8s Discovery (Priority: MEDIUM)

Task 3.1: Implement DNS-Based Discovery

Task 3.2: Kubernetes Headless Service Support

Task 3.3: Add Health Check Endpoint Enhancements

Phase 4: Improve Resilience Testing (Priority: MEDIUM)

Task 4.1: Complete Toxiproxy Integration

Task 4.2: Add Chaos Engineering Test Cases

Task 4.3: Database Comparison After Tests

Phase 5: Improve Test Infrastructure (Priority: LOW)

Task 5.1: Extract Common Test Utilities

Task 5.2: Add Performance Benchmarks

Task 5.3: Improve Test Reliability

Sub-issues

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions