Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD can't redistribute the hot write regions among TiFlash nodes #1235

Closed
JaySon-Huang opened this issue Nov 17, 2020 · 7 comments
Closed

PD can't redistribute the hot write regions among TiFlash nodes #1235

JaySon-Huang opened this issue Nov 17, 2020 · 7 comments
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Nov 17, 2020

Version v4.0.8.

In one of our customer production-environment:

  • there are 6 TiFlash nodes
  • each tiflash-replicated table has 2 replicas
  • each TiFlash node is deployed on 2 SSD disks

But when TiFlash get busy writing, only 2 TiFlash nodes get busy (the IO util get 100%) while others are idle.

@JaySon-Huang JaySon-Huang added the type/enhancement The issue or PR belongs to an enhancement. label Nov 17, 2020
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Nov 20, 2020

Add a script to show the hot write regions in TiFlash store. https://github.com/pingcap/tidb-ansible/pull/1359/files
Actually, it turns out that two TiFlash stores with a similar number of hot write regions have different metrics of IO util. One is almost 60~90% while another is only about 10%.

The result of customer env for running store-hot-regions.py
image

@solotzg
Copy link
Contributor

solotzg commented Nov 24, 2020

message StoreStats {
    uint64 store_id = 1;
    // Capacity for the store.
    uint64 capacity = 2;
    // Available size for the store.
    uint64 available = 3;
    // Total region count in this store.
    uint32 region_count = 4;
    // Current sending snapshot count.
    uint32 sending_snap_count = 5;
    // Current receiving snapshot count.
    uint32 receiving_snap_count = 6;
    // When the store is started (unix timestamp in seconds).
    uint32 start_time = 7;
    // How many region is applying snapshot.
    uint32 applying_snap_count = 8;
    // If the store is busy
    bool is_busy = 9;
    // Actually used space by db
    uint64 used_size = 10;
    // Bytes written for the store during this period.
    uint64 bytes_written = 11;
    // Keys written for the store during this period.
    uint64 keys_written = 12;
    // Bytes read for the store during this period.
    uint64 bytes_read = 13;
    // Keys read for the store during this period.
    uint64 keys_read = 14;
    // Actually reported time interval
    TimeInterval interval = 15;
    // Threads' CPU usages in the store
    repeated RecordPair cpu_usages = 16;
    // Threads' read disk I/O rates in the store
    repeated RecordPair read_io_rates = 17;
    // Threads' write disk I/O rates in the store
    repeated RecordPair write_io_rates = 18;
    // Operations' latencies in the store
    repeated RecordPair op_latencies = 19;
}

By now bytes_written, keys_written, bytes_read, keys_read have not been reported to pd.

@JaySon-Huang JaySon-Huang self-assigned this Dec 3, 2020
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Dec 3, 2020

Does not like leader and followers in TiKV, all TiFlash peer is ready for reading. And TiDB chooses TiFlash peer by using round-robin. So reporting bytes_read, keys_read for hot read regions redistributing is not very meaningful for us.

Only consider reporting the right bytes_written and keys_written to PD.

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Dec 9, 2020

I deployed a cluster with 1 TiDB + 1 PD + 1 TiKV + 2 TiFlash base on version v4.0.8.
The TiFlash and its proxy branch are https://github.com/JaySon-Huang/tics/tree/store_stats_4.0 , https://github.com/JaySon-Huang/tikv/tree/store_stats_4.0 . These two branches fix the problem that the written bytes and written keys at the store level are not reported to PD.

By adding a sysbench workload on this cluster, I found that:

  • move-hot-write-region between TiFlash store rarely happen.
  • In the phase of running the "oltp_update_index" workload, the writes-pressure between two TiFlash nodes is imbalanced, one is about 10 times to another node. But PD still did not generate move-hot-write-region between TiFlash stores.

Another problem, maybe related or not:
I use the PD API: /pd/api/v1/hotspot/regions/write to check the stats of hot write regions. In the TiFlash node, the flow bytes by summing all regions is about 4 times to the TiFlash reported.

@JaySon-Huang
Copy link
Contributor Author

Will continue the discussion in tikv/pd#3261

@JaySon-Huang
Copy link
Contributor Author

@JaySon-Huang
Copy link
Contributor Author

Done in tikv/pd#3261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants