Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Nov 21, 2021

Rationale for this change

I noticed this when I bump datafusion to in roapi and one of our tests started to fail. Two reasons that I think changing to true might be better:

  • For serious production use-cases, collecting stats should make a big different in performance for cases where it could help. It would be good to default to the more performant setup. Ignoring stats seems to be the less common use-case that we can ask users to manually specify it.
  • Be backwards compatible.

@rdettai curious what's your thought on this.

I will add a test case to guard against regression if everyone agrees with the direction.

What changes are included in this PR?

Default value for collect_stat changed to true.

Are there any user-facing changes?

Yes, reverting back to the 5.x behavior.

@rdettai
Copy link
Contributor

rdettai commented Nov 22, 2021

Thanks for spotting this @houqp. I plan on working on the default values soon, as I find them to be a bit confused. They are sourced from multiple places and thus hard to document / understand. I don't mind switching the default for collect_stat back to true. I tried not to change the default behaviors but it seems that this one slipped away.

For serious production use-cases, collecting stats should make a big different in performance for cases where it could help

Sadly things are not that simple 😅. There are some cases where this wouldn't be true. The collect_stats parameter you are referring to here defines whether source level statistics should be fetched during the planning (this will be an overall statistics, not a file level one). This is meant primarily to enable Cost Based Optimizations. In a distributed setup like Ballista, if you activate statistics fetching, as the planning is made on the scheduler node, it implies that a single node will need to open all the files/objects to read the stats. This will be very long if there are many files. It might actually be faster to just get the list of files and distribute the work across nodes (especially if you have many nodes). If what you want is not necessarily to get the source level statistics during the planning but only have them at the file level when executing the ExecutionPlan to enable row group pruning, I think that should be a separate configuration (or maybe we don't even need a configuration, for formats like Parquet we should always get the file/row_group level statistics when reading the file and use them).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fine change to me -- I wonder if there is some way to test it (or perhaps that will be included in @rdettai 's default work)

@houqp
Copy link
Member Author

houqp commented Nov 22, 2021

Thanks @rdettai for the confirmation. Yes, I was thinking of enabling stats in the context of query optimization.

This will be very long if there are many files. It might actually be faster to just get the list of files and distribute the work across nodes (especially if you have many nodes).

One thing I have noticed in existing system is to leverage distributed compute for this kind of metadata gathering during the planing phase. Similar to the datafusion in datafusion setup we have right now, but balista in balista ;) Perhaps we could have a distributed version of Listing Table implementation in ballista in the future.

@houqp houqp added the bug Something isn't working label Nov 22, 2021
@houqp houqp merged commit cf637e7 into apache:master Nov 23, 2021
@houqp houqp deleted the qp_stats branch November 23, 2021 00:10
@rdettai
Copy link
Contributor

rdettai commented Nov 23, 2021

Perhaps we could have a distributed version of Listing Table implementation in ballista in the future

That would be a possibility yes, but I would rather prioritize good support for table formats like Delta and Iceberg 😉, which will solve this in a much more cost effective fashion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants