Skip to content

[EPIC] Improving cost calculations and cost based optimizations #3929

@isidentical

Description

@isidentical

Design document: https://docs.google.com/document/d/1M4mmV7KA1LSj-D-WJA338B4ydlm-8A8D5OPuDE5_SD4/edit

This is a meta issue for improving cost calculations and cost-based optimizations in DataFusion. We already have some statistics collected (mainly from the table sources) and there are estimations for statistics by some of the execution plan nodes, and the overall idea is to improve these as well as possible CBOs.

Main Goals

  • Have enough statistics to start nested join optimizations (Implement nested join optimization #3843). This involves being able to estimate the weight of a join side, and do global re-ordering between join sides to minimize the overall cost of parent joins by reducing the output as much as possible at the bottom levels.
  • Provide a more reliable static analysis phase for physical execution operators (so that range based pruning/predicate pruning can leverage the existing infrastructure on their implementations)
  • What else?

Work in Progress

Planned

Future

  • Support for histograms, so better value distribution when working with cardinality estimations / filter selectivity. Currently, none of the providers we use can directly pass it to us, so we either have to take a peek at the data or only expose the API for other services (like ballista) which can actually collect it and pass to us.

P.S.: feel free to update the text directly or let me know (and I can update it myself)

Metadata

Metadata

Assignees

No one assigned

    Labels

    PROPOSAL EPICA proposal being discussed that is not yet fully underwayenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions