-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Consolidate statistics aggregation #8254
Conversation
let (min_value, max_value) = if stat.has_min_max_set() { | ||
match stat { | ||
ParquetStatistics::Boolean(s) => ( | ||
Some(ScalarValue::Boolean(Some(*s.min()))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may look like a regression in performance, but I think it will actually perform better (as the old code is creating a single row array just to call an accumulator method 😱 )
match max_value.update_batch(&[Arc::new(BooleanArray::from(
vec![Some(*s.max())],
))]
Some(ScalarValue::Float64(Some(*s.max()))), | ||
), | ||
// TODO: file ticket to support fetching byte array (aka string) metadata | ||
ParquetStatistics::ByteArray(_) => (None, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this new structure also makes it clear DataFusion currently ignores statistics for parquet string columns
let mut result_files = vec![]; | ||
// These statistics can be calculated as long as at least one file provides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a second (different) way to aggregate statistics
// files. This only applies when we know the number of rows. It also | ||
// currently ignores tables that have no statistics regarding the | ||
// number of rows. | ||
if num_rows.get_value().unwrap_or(&usize::MIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check may be incorrect in the sense that even if the file statistics are not precise, it may stop reading files early
@@ -143,117 +83,3 @@ pub async fn get_statistics_with_limit( | |||
|
|||
Ok((result_files, statistics)) | |||
} | |||
|
|||
pub(crate) fn create_max_min_accs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all handled in the StatisticsAccumulator now
c43c734
to
0122a08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An update here: I have found there are no less than 2 distinct paths to read / convert parquet statistics into DataFusion format. One used for pruning row groups and one used for reading statistics in general.
I am trying to consolidate them, and put the code into reasonable shape so it can at least read the statistics out as an ArrayRef (and computing min/max on it will be fast).
a05e5e4
to
bdba2b8
Compare
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #8229
Rationale for this change
There were multiple, confusing, and inconsistent paths for aggregating statistics. This both makes it hard to change how Statistics work (e.g. #8078) it also makes for subtle bugs
What changes are included in this PR?
StatisticsAggregator
that handles aggregating statistics from multiple sources, and is well documented and testedNote that this code is based on some authored by @crepererum for InfluxDB IOx.
Are these changes tested?
Yes, substantial new test coverage is added
Are there any user-facing changes?
No user facing change intended