Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Consolidate statistics aggregation #8254

Closed
wants to merge 3 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 17, 2023

Which issue does this PR close?

Closes #8229

Rationale for this change

There were multiple, confusing, and inconsistent paths for aggregating statistics. This both makes it hard to change how Statistics work (e.g. #8078) it also makes for subtle bugs

What changes are included in this PR?

  1. Introduce StatisticsAggregator that handles aggregating statistics from multiple sources, and is well documented and tested
  2. Rewrite existing statistics aggregation to use the new aggregator

Note that this code is based on some authored by @crepererum for InfluxDB IOx.

Are these changes tested?

Yes, substantial new test coverage is added

Are there any user-facing changes?

No user facing change intended

@github-actions github-actions bot added the core Core DataFusion crate label Nov 17, 2023
let (min_value, max_value) = if stat.has_min_max_set() {
match stat {
ParquetStatistics::Boolean(s) => (
Some(ScalarValue::Boolean(Some(*s.min()))),
Copy link
Contributor Author

@alamb alamb Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may look like a regression in performance, but I think it will actually perform better (as the old code is creating a single row array just to call an accumulator method 😱 )

 match max_value.update_batch(&[Arc::new(BooleanArray::from(
                            vec![Some(*s.max())],
                        ))]

Some(ScalarValue::Float64(Some(*s.max()))),
),
// TODO: file ticket to support fetching byte array (aka string) metadata
ParquetStatistics::ByteArray(_) => (None, None),
Copy link
Contributor Author

@alamb alamb Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new structure also makes it clear DataFusion currently ignores statistics for parquet string columns

let mut result_files = vec![];
// These statistics can be calculated as long as at least one file provides
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a second (different) way to aggregate statistics

// files. This only applies when we know the number of rows. It also
// currently ignores tables that have no statistics regarding the
// number of rows.
if num_rows.get_value().unwrap_or(&usize::MIN)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check may be incorrect in the sense that even if the file statistics are not precise, it may stop reading files early

@@ -143,117 +83,3 @@ pub async fn get_statistics_with_limit(

Ok((result_files, statistics))
}

pub(crate) fn create_max_min_accs(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all handled in the StatisticsAccumulator now

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 20, 2023
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An update here: I have found there are no less than 2 distinct paths to read / convert parquet statistics into DataFusion format. One used for pruning row groups and one used for reading statistics in general.

I am trying to consolidate them, and put the code into reasonable shape so it can at least read the statistics out as an ArrayRef (and computing min/max on it will be fast).

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Nov 22, 2023
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 21, 2024
@github-actions github-actions bot closed this Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate statistics aggregation
1 participant