Skip to content

Add configuration to choose specific join implementation #17432

@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

DataFusion has several join implementations like Nested Loop Join, Hash Join, etc. For a given join SQL query, there might be multiple applicable physical join types, and the DF planner/optimizer will pick a optimal one to execute.

DF currently offers limited configurability. Allowing users to prefer or disable specific join types would improve the user experience, especially as more specialized join operators are on to the roadmap. (thanks to @jonathanc-n )

Example

> select * from t1 join t2 on t1.v1 = t2.v1;

-- DF choose HJ by default

> set datafusion.optimizer.prefer_nested_loop_join = true;
> select * from t1 join t2 on t1.v1 = t2.v1;

-- Now it's using NLJ

Describe the solution you'd like

Reference solution from DuckDB and PostgreSQL

pg is using configurations like enable_hashjoin, enable_nestloop. You can turn certain types on and off, I guess the planner will pick the optimal one from only enabled join types.
reference: https://www.postgresql.org/docs/7.1/runtime-config.html

DuckDB only has several configurations to guide the optimizer's decision, but it doesn't provide configurations to turn off certain join types. examples: prefer_range_joins, merge_join_threshold
reference: https://duckdb.org/docs/stable/configuration/overview.html

Idea to implement

DataFusion now has one configuration option to toggle HJ and SMJ, when they're both available

datafusion.optimizer.prefer_hash_join | true | When set to true, the physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can work more efficiently than SortMergeJoin but consumes more memory

To make the configuration easier for all join types, two sets of options can be added

  • enable_join_type: If true, this join type will be considered by the optimizer, otherwise it will be always disabled. If there are no enabled join types to finish the query, an error will be return.
  • prefer_join_type: If a certain join type is both enabled, and possible to use for a certain join query, the preferred join type will be prioritized. If there are multiple preferred join types, the optimizer will pick the best one.

Example

Available options (the implementation shouldn't use acronym, it's for simplicity here):

  • enable_nlj

  • enable_smj

  • enable_hj

  • ...and other available join types

  • prefer_nlj

  • prefer_smj

  • prefer_hj

  • ...and other available join types

Query: select * from t1 join t2 on t1.v1 = t2.v1;


Config 1 -- Enable all, no preference

enable_nlj = true
enable_smj = true
enable_hj  = true

prefer_nlj = false
prefer_smj = false
prefer_hj  = false

(all enabled; no preference; optimizer uses default heuristic) -> HJ


Config 2 -- Force NLJ

enable_nlj = true
enable_smj = false
enable_hj  = false

prefer_nlj = false
prefer_smj = false
prefer_hj  = false

(only NLJ enabled) -> NLJ


Config 3 -- Prefer SMJ, others allowed

enable_nlj = true
enable_smj = true
enable_hj  = true

prefer_nlj = false
prefer_smj = true
prefer_hj  = false

(SMJ preferred if feasible; else fallback by cost) -> SMJ


Config 4 -- Prefer SMJ and HJ equally

enable_nlj = true
enable_smj = true
enable_hj  = true

prefer_nlj = false
prefer_smj = true
prefer_hj  = true

(SMJ & HJ both preferred; the optimizer by default do HJ > SMJ) -> HJ


Config 5 -- Prefer disabled type (ignored)

enable_nlj = false
enable_smj = true
enable_hj  = true

prefer_nlj = true
prefer_smj = false
prefer_hj  = false

(NLJ preference ignored since disabled; choose among enabled) -> HJ


Config 6 -- No join type enabled

enable_nlj = false
enable_smj = false
enable_hj  = false

(none enabled; cannot plan) -> ERROR


Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions