TPCH q15 performance regression after introduce `local_delta` in MemoryTracker

## Bug Report

Please answer these questions before submitting your issue. Thanks!

### 1. Minimal reproduce step (Required)


load tpch-100 data, using 1 TiFlash node
run tpch q15
### 2. What did you expect to see? (Required)
The query time varies from 1.5 second to 3.0 second randomly
### 3. What did you see instead (Required)
1. the query time should be stable
2. the query time should less than 2 second
### 4. What is your TiFlash version? (Required)
master @ 225dabe6942966d1562af738c287b50872f26be8 
### 5. Root cause
The root cause is in TiFlash, the `ParallelAggregatingBlockInputStream` calculate the aggregation in 2 stage:
- stage 1: it use `ParallelInputsProcessor` to do a partial aggregation for each input pipeline
- stage 2: it merge the result of stage 1

Obviously, stage 1 is running using multiple threads, and depends on the size of result data set, stage 2 will use 1 threads or multiple threads: if the total agg key size exceeded `group_by_two_level_threshold` or the total result size exceeded `group_by_two_level_threshold_bytes`, stage 2 will use multiple threads othewise, it will use 1 thread. And if stage 2 is executed using 1 threads, all the result will be put into 1 single `block`.

The total result size is estimated using `memory_tracker`: before executed aggregation, the overall memory usage is saved as `memory_usage_before_aggregation` in [`Aggregator`](https://github.com/pingcap/tiflash/blob/225dabe6942966d1562af738c287b50872f26be8/dbms/src/Interpreters/Aggregator.h#L908), and during the executed of stage 1, it use [`current_memory_usage - memory_usage_before_aggregation`](https://github.com/pingcap/tiflash/blob/225dabe6942966d1562af738c287b50872f26be8/dbms/src/Interpreters/Aggregator.cpp#L607) to decide if need to convert  the aggregated hash table into two-level hash table. And if the hash table is converted into two-level hash table, it will use multiple threads to do the stage 2.

The problem is after introducing `local_delta`, if the memory usage is less than 8MB, it will not update the global memory tracker. By default the `group_by_two_level_threshold_bytes` is 100MB, so assuming that the stage 1 is executed using 20 threads, and each thread uses 7.9MB, then the actual memory usage will be ~158MB,  but since all of these memory usages is tracked in `local_delta`, the global memory tracker does not see these memory usage, so the hash table will not be converted to two-level hash table, thus the stage 2 is executed using 1 thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPCH q15 performance regression after introduce `local_delta` in MemoryTracker #4451

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

5. Root cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TPCH q15 performance regression after introduce local_delta in MemoryTracker #4451

Description