Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

Merged
merged 8 commits into from
Feb 26, 2025

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 18, 2025

This updates InclusiveMetricsEvaluator that uses column stats to skip data files during scan planning.

The evaluator was implementing the older BoundExpressionVisitor interface that only supported BoundReference and not other BoundTerm instances like BoundTransform. After #12304, BoundExtract also needs to be supported. This PR includes #12304 and will be rebased when it is merged.

Filtering works for transformed values when the transform is order preserving. If it is not order preserving (like bucket) the bounds cannot be used.

Most of the changes to BoundExpressionVisitor are to produce the lower and upper bounds values that are tested:

  • For BoundReference, deserialize the bound from the correct lowerBounds or upperBounds map (moved to parseLowerBound and parseLowerBound
  • For BoundTransform, deserialize the bound and transform it if the transform is order-preserving (in transformLowerBound and transformUpperBound)
  • For Extract, deserialize the bound as a map from field name to VariantValue, then convert the value to the internal representation (in extractLowerBound and extractUpperBound)

This adds new test suites for the new BoundTerm cases that are supported:

  • TestInclusiveMetricsEvaluatorWithExtract tests variant cases
  • TestInclusiveMetricsEvaluatorWithTransforms tests transform cases


@ParameterizedTest
@FieldSource("MISSING_STATS_EXPRESSIONS")
public void testMissingStats(Expression expr) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bounds are in metadata, they are trusted. Bounds should be missing in cases where there are variant values with incompatible types.

return current;
}

public static ByteBuffer serializeBounds(Map<String, VariantValue> bounds) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serialization is temporary and will be replaced with a well-defined variant serialization.

@rdblue rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from 0c80b18 to 09fcf7b Compare February 18, 2025 21:20
@Fokko Fokko self-requested a review February 20, 2025 09:42
@rdblue rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch 2 times, most recently from 12b7f7c to b1b068d Compare February 21, 2025 17:59
@github-actions github-actions bot added the spark label Feb 21, 2025
@rdblue rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from b1b068d to 8faa5f1 Compare February 21, 2025 18:01
@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

This needed changes to TestSparkScan because adding support for transforms filters out additional files in unpartitioned tables.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

This now relies on #12374 to move the serialized variant classes into API so that InclusiveMetricsEvaluator can use them for deserializing bounds.

@rdblue rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from 37a0c04 to d1a260c Compare February 21, 2025 23:19
@rdblue rdblue merged commit 2f88ff6 into apache:main Feb 26, 2025
43 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Feb 26, 2025

Thanks for reviewing, @danielcweeks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants