Skip to content

Columnar Hub panics when TiFlash starts before PD cluster bootstrap completes #10859

@JaySon-Huang

Description

@JaySon-Huang

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Build TiFlash with next-gen columnar enabled (includes contrib/tiflash-columnar-hub, introduced by *: refactor proxy to hub lib for columnar #10849).

  2. Copy the binary into the integration-test mount path:

    cmake --workflow --preset dev
    cp cmake-build-debug/dbms/src/Server/tiflash tests/.build/tiflash/tiflash
  3. Start a fresh next-gen disaggregated test cluster (TiFlash compute node with use_columnar = true):

    cd tests/fullstack-test-next-gen
    ./compose.sh down --remove-orphans
    rm -rf data log
    ./compose.sh up -d
  4. Wait only a few seconds (do not manually restart TiFlash), then verify TiFlash is down:

    ./compose.sh exec -T tiflash-cn0 bash -c \
      '/tiflash/tiflash client --host 127.0.0.1 --port 9000 --query "select 1"'

    Or run any integration test immediately:

    ./compose.sh exec -T tiflash-cn0 bash -c \
      'cd /tests && ENABLE_NEXT_GEN=true ./run-test.sh fullstack-test/sample.test'

Environment notes

  • TiFlash config: tests/docker/next-gen-config/tiflash_cn.toml with [flash] use_columnar = true.
  • Compose layout: tests/fullstack-test-next-gen/disagg_tiflash.rocky9.yaml.
  • tiflash-cn0 has depends_on: [minio0, tikv0], but this only waits for the TiKV container to start, not for PD cluster bootstrap to finish.

Root cause in code

In contrib/tiflash-columnar-hub/hub-runtime/src/run.rs, Columnar Hub registers its store to PD without retry and panics on the first failure:

pd_client.put_store(store).unwrap_or_else(|err| {
    panic!(
        "failed to register TiFlash Columnar Hub store {} to PD: {}",
        store_id, err
    )
});

When TiFlash starts in parallel with TiKV, this call can happen before TiKV bootstraps the PD cluster.

2. What did you expect to see? (Required)

  • TiFlash should start successfully even when it comes up at roughly the same time as TiKV.
  • Columnar Hub should retry (or wait) until PD reports the cluster is bootstrapped, then register its store.
  • TiFlash should listen on tcp_port (9000) and integration tests should proceed normally.

3. What did you see instead (Required)

TiFlash aborts during startup (~8 seconds after launch). The proxy thread panics inside Columnar Hub, which triggers SIGABRT in the main process.

Container stdout (compose logs tiflash-cn0):

thread '<unnamed>' panicked at hub-runtime/src/run.rs:1200:13:
failed to register TiFlash Columnar Hub store 1 to PD: cluster 7642623563617450895 is not bootstrapped
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at library/core/src/panicking.rs:218:5:
panic in a function that cannot unwind
thread caused non-unwinding panic. aborting.

TiFlash error log:

[ERROR] [BaseDaemon.cpp:368] ["(from thread 2) Received signal Aborted(6)."]

Stack trace points to run_raftstore_proxy_ffi at contrib/tiflash-columnar-hub/hub-runtime/src/lib.rs:70, called from ProxyStateMachine.h:272.

After the crash, TiFlash does not listen on port 9000. Running tests fails immediately with:

Code: 210. DB::NetException: Connection refused: (127.0.0.1:9000)

Workaround (confirms this is a startup race)

After TiKV has bootstrapped PD (typically 15–30 seconds after compose up), manually restarting TiFlash succeeds:

./compose.sh restart tiflash-cn0
sleep 15
./compose.sh exec -T tiflash-cn0 bash -c \
  '/tiflash/tiflash client --host 127.0.0.1 --port 9000 --query "select 1"'
# returns: 1

At failure time, PD already has a cluster ID but no bootstrapped store yet; after TiKV bootstrap, PD /pd/api/v1/stores shows TiKV store Up.

Suggested fix directions

  1. Retry put_store when PD returns cluster is not bootstrapped (similar to TiKV startup behavior).
  2. Or delay TiFlash startup until PD cluster bootstrap completes (compose healthcheck / init container).
  3. Avoid panicking across the FFI boundary; surface a retriable error instead of aborting the whole process.

4. What is your TiFlash version? (Required)

Release Version: v9.0.0-beta.2.pre-168-g4f11187d88
Git Commit Hash: 4f11187d881270c820a8454008518a47dca8c1f5
Git Branch:      jayson/temp_fix
UTC Build Time:  2026-05-22 07:46:24
Enable Features: ... next-gen columnar ...
Profile:         DEBUG

Related upstream change: #10849 (*: refactor proxy to hub lib for columnar).

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions