Skip to content

KVStore.Close closes pdClient before RegionCache, may causing spurious loadStore from PD failed (grpc: the client connection is closing) during shutdown #1852

@fzzf678

Description

@fzzf678

What happened

We frequently see error logs like:

  [ERROR] [store_cache.go:471] ["loadStore from PD failed"] [id=...] [error="rpc error: code = Canceled desc = grpc: the client connection is
  closing"]

This happens when TiDB runs ADD INDEX with Lightning local backend (local sort). Each DDL job creates its own tikv.KVStore and closes it at the end, and the error occur, triggering false alerts.

Reproduction and Investigation

sysbench --db-driver=mysql --mysql-db=sysbench_2 --report-interval=10 --mysql-user=root  --mysql-password='' --time=0 --mysql-port=4604 --mysql-host=127.0.0.1 --tables=100 --table-size=4000000 --threads=100  --mysql-ignore-errors=3989,8113,8028 --db-ps-mode=disable  oltp_read_write prepare/run

With sysbench workload and continue to submit add index. Use pingcap/tidb#65703, adding debug logs (UUID propagated to RegionCache/StoreCache and log of the shutdown sequence), we can observe the following order for the same UUID:

2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:421] ["pdClient closed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884104296]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:434] ["before close region cache"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884115585] [len(s.regionCache.GetAllStores())=1]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[store_cache.go:471] ["loadStore from PD failed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884136022] [id=1] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"] [scheduler.closed()=true]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:436] ["regionCache closed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884226546]

This shows PD client is closed first, while RegionCache background tasks (store re-resolve / store list refresh) are still running and calling PD RPC, then hit grpc: the client connection is closing.

Root cause

In tikv/kv.go (*KVStore).Close(), current close order is:

  • pdClient.Close() (and pdHttpClient.Close())
  • regionCache.Close()

But RegionCache has its own background runner/tickers (e.g. store list refresh) that may still call pdClient.GetStore/GetAllStores until
regionCache.Close() cancels and waits the goroutines. If PD client is closed first, those background calls fail and emit error logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.first-time-contributorIndicates that the PR was contributed by an external member and is a first-time contributor.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions