Fine grained manipulation on hashing partitioned data #4631
Closed
Description
Enhancement
We know that some operators (agg, join, etc.) could benefit from prehashing. If each inputstream belongs to one hash bucket:
- For agg, we do not need the merging phase, each thread could directly generate the final blocks.
- For build side of join, we do not need any lock and the resizing of hash table could only has influence to one thread.
- For probe side of join, it could start as soon as its corresponding build job ends, not the whole build phase.
- For exchange sender, we do not need to partition the data again, just directly send them.
Note that all the potential enhancements requires:
- Use the same (or compatible) hash function.
- The prehash key is the super set of the sink hash key.
So how to do prehashing? Here are some ideas:
- For a hash partition table, just let each inputstream corresponds to one partition (N:1 or 1:1 relation). The key is how to carry hash info along with inputstream.
- For large data volume, the agg will merge data into 256 buckets, however it then discards the hashing info, merging them into one inputstream. The upper operators could benefit from disclosing the hashing info from agg.
- We could insert some virtual hash partition operators near to TableScan and Exchange operators (which are leaves of a MPP task) to generate inputstreams with hashing info.
Especially, for Exchange, ExchangeSender is a better place to partition data than ExchangeReceiver: partitioning is already part of ExchangeSender's job.
Note: blindly split data at ExchangeSender could make the packet smaller and weaken the effect of vectorization. A previous demo showed that blindly splitting will decrease the performance (using current implementation), while well-designed batching will increase the performance a lot.
Subtasks for window function:
- tiflash code: Fine grained shuffle for window function #5142
- kvproto: tiflash: add fine_grained_shuffle related field kvproto#921
- enable fine grained shuffle when dispatching MPPTasks: support fine grained shuffle when dispatching MPPTasks tidb#35342
- microbenchmark for fine grained shuffle: add microbenchmark for exchange and window function #5138
- design doc: add fine-grained shuffle design doc #5149
Subtasks for HashAgg/Join:
- HashAgg + fine-grained shuffle