Skip to content

[Feature] Remove Etcd Dependency via DNS-Based Node Discovery #13621

@hanahmily

Description

@hanahmily

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

1. Context & Motivation

Current State: BanyanDB currently relies on etcd as a hard dependency for cluster coordination, metadata storage, and node discovery (Meta Nodes). This requires maintaining a separate etcd cluster, managing leases for health checks, and handling complex certificate management for secure communication.

Goal: Transform BanyanDB into a "Zero-Dependency" architecture by replacing the etcd-based registry with a decentralized DNS-based Node Discovery mechanism. This simplifies deployment on Kubernetes (StatefulSets) and static environments (VMs/Edge).


2. Technical Design Specification

2.1 Core Abstraction: NodeRegistry

We will introduce a modular NodeRegistry interface to decouple the discovery logic from the specific implementation.

  • Old Flow: Liaison -> Watch etcd Key -> Update gRPC Connection.
  • New Flow: Liaison -> Poll NodeRegistry -> Update gRPC Connection.

2.2 Discovery Mechanism (DNS)

The primary implementation will be the DNS Registry, operating in a "Pull-based" model.

  • Query Strategy:

    1. Primary: Query SRV Records (RFC 2782) to discover target hostnames and dynamic ports (critical for K8s Headless Services).
    2. Fallback: Static Registry: To support environments without DNS or for emergency overrides, loads a fixed list of peers from a local file (topology.yml). Support hot reloading of this file.
  • Polling & Caching:

    • Implement a Custom gRPC Resolver (Go) that polls DNS at a configurable interval (default: 30s). In the startup process, the interval should be 5 seconds to reflect the topology change. There should be two flags to set up the intervals.
    • Two-Layer Caching: Respect DNS TTL (Infrastructure layer) and maintain an internal snapshot (Application layer).
  • Resilience (Serve Stale):

    • If the DNS server returns a failure (e.g., SERVFAIL, Timeout), the resolver MUST NOT flush the current address list.
    • It must log a warning and return the stale (last known good) list of addresses to ensure partition tolerance.

2.3 Peer Discovery

  • Liaison Node Discovery Liaison nodes will discover the data nodes
  • Data Node Mesh: Data nodes will discover peers by resolving the same DNS name they publish themselves.
  • Lifecycle: Hot nodes discover Warm/Cold nodes.

2.4 Two-Phase Discovery

Instead of reading the full Node struct from etcd before connecting, the Liaison/Data node will first connect via DNS and then query the node directly for its details.

Add a new gRPC service to return the Node.

2.5 Troubleshooting DNS Discovery

In the absence of etcdctl, operators need new tools.

State gRPC service: bydbctl/UI -> calls (Liaison/Data).GetClusterState() -> returns the internal list derived from DNS. The service will return more internal state than DNS in the future.
Metrics: New metrics are required:

  • discovery_dns_lookup_duration_seconds
  • discovery_dns_lookup_failures_total
  • discovery_cluster_size (Gauge)

3. Task List

  • Implement DNSNodeRegistry with net.LookupSRV and net.LookupHost.
  • Implement StaticNodeRegistry for fallback/file-based discovery.
  • Update Helm Charts.
  • Create E2E test suite for startup.
  • Update Documentation ( Concept and operational document )

Use case

No response

Related issues

No response

Are you willing to submit a pull request to implement this on your own?

  • Yes I am willing to submit a pull request on my own!

Code of Conduct

Metadata

Metadata

Assignees

Labels

databaseBanyanDB - SkyWalking native databaseenhancementEnhancement on performance or codes

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions