-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
DataFusion has several join implementations like Nested Loop Join, Hash Join, etc. For a given join SQL query, there might be multiple applicable physical join types, and the DF planner/optimizer will pick a optimal one to execute.
DF currently offers limited configurability. Allowing users to prefer or disable specific join types would improve the user experience, especially as more specialized join operators are on to the roadmap. (thanks to @jonathanc-n )
Example
> select * from t1 join t2 on t1.v1 = t2.v1;
-- DF choose HJ by default
> set datafusion.optimizer.prefer_nested_loop_join = true;
> select * from t1 join t2 on t1.v1 = t2.v1;
-- Now it's using NLJ
Describe the solution you'd like
Reference solution from DuckDB and PostgreSQL
pg is using configurations like enable_hashjoin, enable_nestloop. You can turn certain types on and off, I guess the planner will pick the optimal one from only enabled join types.
reference: https://www.postgresql.org/docs/7.1/runtime-config.html
DuckDB only has several configurations to guide the optimizer's decision, but it doesn't provide configurations to turn off certain join types. examples: prefer_range_joins, merge_join_threshold
reference: https://duckdb.org/docs/stable/configuration/overview.html
Idea to implement
DataFusion now has one configuration option to toggle HJ and SMJ, when they're both available
datafusion.optimizer.prefer_hash_join | true | When set to true, the physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can work more efficiently than SortMergeJoin but consumes more memory
To make the configuration easier for all join types, two sets of options can be added
enable_join_type: Iftrue, this join type will be considered by the optimizer, otherwise it will be always disabled. If there are no enabled join types to finish the query, an error will be return.prefer_join_type: If a certain join type is both enabled, and possible to use for a certain join query, the preferred join type will be prioritized. If there are multiple preferred join types, the optimizer will pick the best one.
Example
Available options (the implementation shouldn't use acronym, it's for simplicity here):
-
enable_nlj -
enable_smj -
enable_hj -
...and other available join types
-
prefer_nlj -
prefer_smj -
prefer_hj -
...and other available join types
Query: select * from t1 join t2 on t1.v1 = t2.v1;
Config 1 -- Enable all, no preference
enable_nlj = true
enable_smj = true
enable_hj = true
prefer_nlj = false
prefer_smj = false
prefer_hj = false
(all enabled; no preference; optimizer uses default heuristic) -> HJ
Config 2 -- Force NLJ
enable_nlj = true
enable_smj = false
enable_hj = false
prefer_nlj = false
prefer_smj = false
prefer_hj = false
(only NLJ enabled) -> NLJ
Config 3 -- Prefer SMJ, others allowed
enable_nlj = true
enable_smj = true
enable_hj = true
prefer_nlj = false
prefer_smj = true
prefer_hj = false
(SMJ preferred if feasible; else fallback by cost) -> SMJ
Config 4 -- Prefer SMJ and HJ equally
enable_nlj = true
enable_smj = true
enable_hj = true
prefer_nlj = false
prefer_smj = true
prefer_hj = true
(SMJ & HJ both preferred; the optimizer by default do HJ > SMJ) -> HJ
Config 5 -- Prefer disabled type (ignored)
enable_nlj = false
enable_smj = true
enable_hj = true
prefer_nlj = true
prefer_smj = false
prefer_hj = false
(NLJ preference ignored since disabled; choose among enabled) -> HJ
Config 6 -- No join type enabled
enable_nlj = false
enable_smj = false
enable_hj = false
(none enabled; cannot plan) -> ERROR
Describe alternatives you've considered
No response
Additional context
No response