collect table stats by default for listing table #1347

houqp · 2021-11-21T21:21:08Z

Rationale for this change

I noticed this when I bump datafusion to in roapi and one of our tests started to fail. Two reasons that I think changing to true might be better:

For serious production use-cases, collecting stats should make a big different in performance for cases where it could help. It would be good to default to the more performant setup. Ignoring stats seems to be the less common use-case that we can ask users to manually specify it.
Be backwards compatible.

@rdettai curious what's your thought on this.

I will add a test case to guard against regression if everyone agrees with the direction.

What changes are included in this PR?

Default value for collect_stat changed to true.

Are there any user-facing changes?

Yes, reverting back to the 5.x behavior.

rdettai · 2021-11-22T09:10:52Z

Thanks for spotting this @houqp. I plan on working on the default values soon, as I find them to be a bit confused. They are sourced from multiple places and thus hard to document / understand. I don't mind switching the default for collect_stat back to true. I tried not to change the default behaviors but it seems that this one slipped away.

For serious production use-cases, collecting stats should make a big different in performance for cases where it could help

Sadly things are not that simple 😅. There are some cases where this wouldn't be true. The collect_stats parameter you are referring to here defines whether source level statistics should be fetched during the planning (this will be an overall statistics, not a file level one). This is meant primarily to enable Cost Based Optimizations. In a distributed setup like Ballista, if you activate statistics fetching, as the planning is made on the scheduler node, it implies that a single node will need to open all the files/objects to read the stats. This will be very long if there are many files. It might actually be faster to just get the list of files and distribute the work across nodes (especially if you have many nodes). If what you want is not necessarily to get the source level statistics during the planning but only have them at the file level when executing the ExecutionPlan to enable row group pruning, I think that should be a separate configuration (or maybe we don't even need a configuration, for formats like Parquet we should always get the file/row_group level statistics when reading the file and use them).

alamb

This looks like a fine change to me -- I wonder if there is some way to test it (or perhaps that will be included in @rdettai 's default work)

houqp · 2021-11-22T22:35:14Z

Thanks @rdettai for the confirmation. Yes, I was thinking of enabling stats in the context of query optimization.

This will be very long if there are many files. It might actually be faster to just get the list of files and distribute the work across nodes (especially if you have many nodes).

One thing I have noticed in existing system is to leverage distributed compute for this kind of metadata gathering during the planing phase. Similar to the datafusion in datafusion setup we have right now, but balista in balista ;) Perhaps we could have a distributed version of Listing Table implementation in ballista in the future.

rdettai · 2021-11-23T13:23:42Z

Perhaps we could have a distributed version of Listing Table implementation in ballista in the future

That would be a possibility yes, but I would rather prioritize good support for table formats like Delta and Iceberg 😉, which will solve this in a much more cost effective fashion.

collect table stats by default for listing table

07bc2c7

houqp requested review from Dandandan and alamb November 21, 2021 21:21

github-actions bot added the datafusion label Nov 21, 2021

jimexist approved these changes Nov 22, 2021

View reviewed changes

alamb approved these changes Nov 22, 2021

View reviewed changes

add test

e660248

houqp added the bug Something isn't working label Nov 22, 2021

houqp merged commit cf637e7 into apache:master Nov 23, 2021

houqp deleted the qp_stats branch November 23, 2021 00:10

rdettai mentioned this pull request Nov 26, 2021

Major performance regression in reading partitioned Parquet data on master #1363

Closed

This was referenced Oct 9, 2022

Provide a way to enable source level statistics for tables registered in the CLI #3774

Open

MINOR: correct ListingOptions.try_new docs to include the enabled stat collection #3775

Merged

adriangb mentioned this pull request May 18, 2025

Make SessionContext::register_parquet obey collect_statistics config #16080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

collect table stats by default for listing table #1347

collect table stats by default for listing table #1347

Uh oh!

houqp commented Nov 21, 2021

Uh oh!

rdettai commented Nov 22, 2021 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

houqp commented Nov 22, 2021 •

edited

Loading

Uh oh!

rdettai commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

collect table stats by default for listing table #1347

collect table stats by default for listing table #1347

Uh oh!

Conversation

houqp commented Nov 21, 2021

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

rdettai commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

houqp commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdettai commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdettai commented Nov 22, 2021 •

edited

Loading

houqp commented Nov 22, 2021 •

edited

Loading