Closed
Description
The PR #13481 proposes a new method of acquiring diagnostic information in TiDB and exposing diagnostic information by the system tables so that users can query using SQL. The purpose of the proposal #13481 is aim to improve the efficiency of the cluster-based information query, state acquisition, log retrieval, one-click inspection, and fault diagnosis
Note
This issue is a TODO list catalog that is used to clarify the split details of the entire feature. If you are interested in a part, use the following workflow:
- Select the module you are interested in.
- Create a new issue to claim the relevant work and describe the rough implementation in the new issue.
- File a new pull request.
Issues:
- Protocol Definition
- Define the
Diagnostics gRPC Service
and the related message type in kvproto Define theDiagnostics gRPC Service
and the relevent message type #13581 @lonng
- Define the
- Information Collection
- Cluster Topology
- Add a system table to provide cluster topology
- The current implementation should be refined and the
ID
andNAME
columns should be deleted Remove theID
andNAME
columns inTIDB_CLUSTER_INFO
#13586 @lonng
- Cluster Configuration
- Add a system table to provide cluster configuration
- TiDB infoschema: add TIDB_CLUSTER_CONFIG virtual table to retrieve all instance config #13063 @lonng
- TiKV infoschema: add TIDB_CLUSTER_CONFIG virtual table to retrieve all instance config #13063 @lonng
- PD infoschema: add TIDB_CLUSTER_CONFIG virtual table to retrieve all instance config #13063 @lonng
- TiDB: Predicates push down executor: refactor the way of retreiving remote component configuration #13832 @lonng
- Add a system table to provide cluster configuration
- Cluster Performance Sampling
- Add HTTP API for cluster components to get performance sample data
- Add system table to provide query via SQL
- Information Collection Framework:
- Pluggable Information Collection Framework to support extended information collection rules
- Information Collection Rules
- Hardware Information
- CPU information: number of physical cores, number of logical cores, NUMA information, CPU frequency, CPU vendor, L1/L2/L3 cache size
- NIC information: NIC device name, NIC enabled, manufacturer, model, bandwidth, driver version, interface queue number (optional)
- Disk information: disk name, disk capacity, disk usage, disk partition, mount information
USB device list- Memory information
- Hardware Information
- System Information
- Kernel information: sysctl -a / ulimit -a
- Process information: current process name, command line parameters, executable file path, pid, environment variables, memory, startup time, uid, gid, process status
File descriptor information: Available Quantity, Current Used Quantity
- Load Information
- CPU usage, 1/5/15 minute load
- Memory: Total/Free/Available/Buffers/Cached/Active/Inactive/Swap
- Disk IO:
- TiKV server: collect load/hardware/system information tikv/tikv#6135 @lonng
- tps: The number of transfers per second that were issued to the device.
- rrqm/s: The number of read requests merged per second that were queued to the device.
- wrqm/s: The number of write requests merged per second that were queued to the device.
- r/s: The number (after merges) of read requests completed per second for the device.
- w/s: The number (after merges) of write requests completed per second for the device.
- rsec/s: The number of sectors (kilobytes, megabytes) read from the device per second.
- wsec/s: The number of sectors (kilobytes, megabytes) written to the device per second.
- await: The average time (in milliseconds) for I/O requests issued to the device to be served.
- %util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device)
- TiDB/PD @crazycs520 server: add rpcserver to get other tidb server info for diagnostics #13693
- tps: The number of transfers per second that were issued to the device.
- rrqm/s: The number of read requests merged per second that were queued to the device.
- wrqm/s: The number of write requests merged per second that were queued to the device.
- r/s: The number (after merges) of read requests completed per second for the device.
- w/s: The number (after merges) of write requests completed per second for the device.
- rsec/s: The number of sectors (kilobytes, megabytes) read from the device per second.
- wsec/s: The number of sectors (kilobytes, megabytes) written to the device per second.
- await: The average time (in milliseconds) for I/O requests issued to the device to be served.
- %util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device)
- TiKV server: collect load/hardware/system information tikv/tikv#6135 @lonng
- Network IO
- TiKV server: collect load/hardware/system information tikv/tikv#6135 @lonng
- IFACE: name of the network interface for which statistics are reported.
- rxpck/s: total number of packets received per second.
- txpck/s: total number of packets transmitted per second.
- rxkB/s: total number of kilobytes received per second.
- txkB/s: total number of kilobytes transmitted per second.
- rxcmp/s: number of compressed packets received per second.
- txcmp/s: number of compressed packets transmitted per second.
- rxmcst/s: number of multicast packets received per second.
- TiDB/PD @crazycs520 server: add rpcserver to get other tidb server info for diagnostics #13693
- IFACE: name of the network interface for which statistics are reported.
- rxpck/s: total number of packets received per second.
- txpck/s: total number of packets transmitted per second.
- rxkB/s: total number of kilobytes received per second.
- txkB/s: total number of kilobytes transmitted per second.
- rxcmp/s: number of compressed packets received per second.
- txcmp/s: number of compressed packets transmitted per second.
- rxmcst/s: number of multicast packets received per second.
- TiKV server: collect load/hardware/system information tikv/tikv#6135 @lonng
- System Info Tables
- Hardward Info
- Software Info
- Cluster Topology
- Cluster Memory Table
- Memory table global view *: support read tidb cluster memory table #13065 @crazycs520
- Memory table refactor
- Use virtual table framework manage information schema infoschema: place INFORMATION_SCHEMA in new virtual table framework #13696 @lonng
- Extra the
LogicalMemTable
part fromDataSource
planner: extract a LogicalMemTable from DataSource to decouple memory/stored tables #13741 @lonng - Predication pushdown framework for virtual table planner: support push predicates down to the memory table #13821 @lonng
- Logging framework
- Log predicate pushdown
- Log LogReader executor implementation
- gRPC Service implementation
- Log system table
- add seqID in TiKV slow log, This will help indicate which SQL in a big transaction, because the transaction start_ts is same.
- Metrics information framework
- Basic metrics information system table query framework
- Add
remote-metrics-storage
configuration - Implement the first version of the PromQL query interface based on Proxy
- Metrics predication pushdown
- Query expression mapping rules
- Metric information table query framework.
- TiDB query metric with promQL and present as table. infoschema: add metric database/table to query cluster metric table. #13757 @crazycs520
- Metric information table query framework.
- Diagnostics Framework
- Inspection schema *: implement the INSPECTION_SCHEMA to provide snapshot of inspection tables #14147 @lonng
- Diagnostics framework executor *: implement the diagnostics inspection framework #14114 @lonng
- Diagnostics common rules *: implement the diagnostics inspection framework #14114 @lonng
Teachability, Documentation, Adoption, Migration Strategy:
Proposal: #13481
Activity