Consider using a disk-based hash table for hash join avoiding OOM #11607
Description
Feature Request
Is your feature request related to a problem? Please describe:
Consider using a disk-based hash table for hash join avoiding OOM.
HashJoinExecutor
uses a hash table describing the map of join keys
and inner table rows.
TiDB's hash join is implemented by innerResult
and mvmap.MVMap
. The innerResult
stores all the rows of the inner table, and the mvmap.MVMap
stores the map of (join key, inner table pointer). This allows us to use these two structures to get a map of join keys
and inner table rows.
When the inner table is particularly large, the innerResult
will take up a lot of memory; when the join key is particularly large, mvmap.MVMap
will also take up a lot of memory. There will be problems with OOM at this time.
Describe the feature you'd like:
- We already have a config
mem-quota-query
, which set the memory quota for a query in bytes. - Introduce a new config
oom-use-tmp-storage
, default istrue
. Set to true to enable use of temporary disk for some executors(in this issue, it is hash join) whenmem-quota-query
is exceeded. - Show disk usage of an executor in
explain analyze
- Show disk usage of a query in
SELECT * FROM information_schema.processlist;
- Consider disk usage in cost model.
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
tasks:
- The improvement of mvmap.MVMap
- hash join executor: decrease the memory usage of hashTable in HashJoinExec #11832
- index join
- performance and code clean executor: reorg codes for hashtable in HashJoinExec #11937
- Disk-based innerResult
- hash join
- utilities: executor: utilities for disk-based hash join #12116
- implement disk-based hash join: executor: implement disk-based hash join #12067
- index join
- cost model, explain analyze, and disk usage control
- change cost model of a hash join if it will be spilled planner: consider disk cost in hashJoin #13246
- show disk usage information in
explain analyze
show disk usage information in explain analyze #12625
Some tiny issues
- [For new contributor]Show disk usage of a query in
SELECT * FROM information_schema.processlist;
Show disk usage of a query in information_schema.processlist #13931 - [For new contributor] Show disk usage of a query in low query and statement summary Show disk usage of a query in slow query and statement summary #16883
- add metrics for disk usage of a query add metrics for disk usage of a query #17263
- [For new contributor]change the default value of
mem-quota-query
change the default value ofmem-quota-query
#12937 - [For new contributor]temporary storage usage limitation of all queries. temporary storage usage limitation of total queries. #13983
- [For new contributor]Define temporary storage in config file. Define temporary storage in config file. #13982
- [help wanted]multiple instances of tidb-server may use the same temporary directy multiple instances of tidb-server may use the same temporary directory #13981
Activity