Skip to content

Conversation

@berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Nov 8, 2023

Which issue does this PR close?

Related to #8078.

Rationale for this change

After enum Precision is introduced in DF, some bugs are discovered. This PR resolves 3 bugs:

  1. singleton check for ColumnStatistics,
  2. limit in datasource with num rows statistics,
  3. min - max exactness of columns with constant value.

What changes are included in this PR?

1st fix: A column is labeled as singleton only if its min and max values are exact and equal.
2nd fix: To stop processing, only exact count of rows is regarded. Otherwise, we should continue to process until range estimation of precision implemented.
3rd fix: During the analysis in filter statistics, if a column is filtered with a constant value (e.g. c=1), we set its min and max values as exact.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 8, 2023
@berkaysynnada berkaysynnada marked this pull request as draft November 8, 2023 16:15
@berkaysynnada berkaysynnada marked this pull request as ready for review November 8, 2023 17:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

@berkaysynnada
Copy link
Contributor Author

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

I have added tests covering the 1st and 3rd cases, but can't find a test suite for the 2nd case that I can demonstrate the need for the fix. Do you have any advice for me to do that easily?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- I think this is better for sure. I hope to keep improving test coverage as part of #8078

// currently ignores tables that have no statistics regarding the
// number of rows.
if num_rows.get_value().unwrap_or(&usize::MIN) <= &limit.unwrap_or(usize::MAX) {
let conservative_num_rows = match num_rows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants