Skip to content

Never fallback to cartesian product for join estimation when we know the min/max values for columns #3813

@isidentical

Description

@isidentical

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
distinct_count is usually expensive to compute, so some platforms which save parquet files abstain from injecting it at the metadata section. We should be able to estimate the join cardinality without it before falling back to cartesian product.

Describe the solution you'd like
Since we already require min/max values to be present, we should be able to just do min(num_left_rows - num_nulls or 0, scalar_range(left_stats.min, left_stats.max)) to determine an alternative distinct count.

Describe alternatives you've considered
None.

Additional context
Original discussion was here #3787 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions