-
Notifications
You must be signed in to change notification settings - Fork 253
Description
What happened
We frequently see error logs like:
[ERROR] [store_cache.go:471] ["loadStore from PD failed"] [id=...] [error="rpc error: code = Canceled desc = grpc: the client connection is
closing"]
This happens when TiDB runs ADD INDEX with Lightning local backend (local sort). Each DDL job creates its own tikv.KVStore and closes it at the end, and the error occur, triggering false alerts.
Reproduction and Investigation
sysbench --db-driver=mysql --mysql-db=sysbench_2 --report-interval=10 --mysql-user=root --mysql-password='' --time=0 --mysql-port=4604 --mysql-host=127.0.0.1 --tables=100 --table-size=4000000 --threads=100 --mysql-ignore-errors=3989,8113,8028 --db-ps-mode=disable oltp_read_write prepare/run
With sysbench workload and continue to submit add index. Use pingcap/tidb#65703, adding debug logs (UUID propagated to RegionCache/StoreCache and log of the shutdown sequence), we can observe the following order for the same UUID:
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:421] ["pdClient closed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884104296]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:434] ["before close region cache"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884115585] [len(s.regionCache.GetAllStores())=1]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[store_cache.go:471] ["loadStore from PD failed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884136022] [id=1] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"] [scheduler.closed()=true]
2026-01-21 20:33:44 (UTC+08:00)TiDB 10.2.4.4:4604[kv.go:436] ["regionCache closed"] [uuid=8480523e-7c0e-45a4-9eeb-7c34c85576ba] [ts_unix_ns=1768998824884226546]
This shows PD client is closed first, while RegionCache background tasks (store re-resolve / store list refresh) are still running and calling PD RPC, then hit grpc: the client connection is closing.
Root cause
In tikv/kv.go (*KVStore).Close(), current close order is:
- pdClient.Close() (and pdHttpClient.Close())
- …
- regionCache.Close()
But RegionCache has its own background runner/tickers (e.g. store list refresh) that may still call pdClient.GetStore/GetAllStores until
regionCache.Close() cancels and waits the goroutines. If PD client is closed first, those background calls fail and emit error logs.