Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite bloom filters to use contains API #8442

Merged
merged 1 commit into from
Dec 26, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 6, 2023

Which issue does this PR close?

Closes #8376
Closes #8397

Rationale for this change

What changes are included in this PR?

We introduced contained API in #8440, and now we can use it to simplify the bloom filter code

This also makes it possible to do things like Support IN lists with more than three constants in predicates for bloom filters

Are these changes tested?

Covered by existing tests (including that added in #8433)

Are there any user-facing changes?

No

Err(e) => {
log::error!("Error evaluating row group predicate values when using BloomFilterPruningPredicate {e}");
log::debug!("Ignoring error reading bloom filter: {e}");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, errors in logs I think are meant to be "something bad happened that caused a user error" -- in this case the query will proceed so I think debug is a better level.

If someone needs it in production logs perhaps we can change it to log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug level looks good to me. non-fatal errors on query path are easy to flow 👍

///
/// We only checked `BinaryExpr` but it also support `InList`,
/// Because of the `optimizer` will convert `InList` to `BinaryExpr`.
fn prune_expr_with_bloom_filter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic is now split between LiteralGuarantee #8437 and PruningPredicate: #8440


for column_name in literal_columns {
let Some((column_idx, _field)) =
parquet_column(builder.parquet_schema(), arrow_schema, &column_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixes a potential bug similar to #8335 by finding the correct fields

@alamb alamb marked this pull request as ready for review December 23, 2023 12:44
@alamb
Copy link
Contributor Author

alamb commented Dec 23, 2023

This PR is now ready for review @haohuaijin and @waynexia

Copy link
Contributor

@haohuaijin haohuaijin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a series of excellent pr @alamb 👍, It is LGTM!

"BloomFilterPruningPredicate only support binary expr".to_string(),
)),
}
impl PruningStatistics for BloomFilterStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 💯 , thanks @alamb

Err(e) => {
log::error!("Error evaluating row group predicate values when using BloomFilterPruningPredicate {e}");
log::debug!("Ignoring error reading bloom filter: {e}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug level looks good to me. non-fatal errors on query path are easy to flow 👍

@alamb
Copy link
Contributor Author

alamb commented Dec 26, 2023

Thank you for the review @waynexia and @haohuaijin

@alamb alamb merged commit e10d3e2 into apache:main Dec 26, 2023
23 checks passed
@alamb alamb deleted the alamb/rewrite_bloom_api branch December 26, 2023 11:53
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support general pruning based on <col> = 'const' in PruningPredicate
3 participants