API, Core: Update inclusive metrics evaluator for extract and transforms #12311

rdblue · 2025-02-18T17:03:42Z

This updates InclusiveMetricsEvaluator that uses column stats to skip data files during scan planning.

The evaluator was implementing the older BoundExpressionVisitor interface that only supported BoundReference and not other BoundTerm instances like BoundTransform. After #12304, BoundExtract also needs to be supported. This PR includes #12304 and will be rebased when it is merged.

Filtering works for transformed values when the transform is order preserving. If it is not order preserving (like bucket) the bounds cannot be used.

Most of the changes to BoundExpressionVisitor are to produce the lower and upper bounds values that are tested:

For BoundReference, deserialize the bound from the correct lowerBounds or upperBounds map (moved to parseLowerBound and parseLowerBound
For BoundTransform, deserialize the bound and transform it if the transform is order-preserving (in transformLowerBound and transformUpperBound)
For Extract, deserialize the bound as a map from field name to VariantValue, then convert the value to the internal representation (in extractLowerBound and extractUpperBound)

This adds new test suites for the new BoundTerm cases that are supported:

TestInclusiveMetricsEvaluatorWithExtract tests variant cases
TestInclusiveMetricsEvaluatorWithTransforms tests transform cases

rdblue · 2025-02-18T17:15:02Z

core/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluatorWithExtract.java

+
+  @ParameterizedTest
+  @FieldSource("MISSING_STATS_EXPRESSIONS")
+  public void testMissingStats(Expression expr) {


If the bounds are in metadata, they are trusted. Bounds should be missing in cases where there are variant values with incompatible types.

rdblue · 2025-02-18T17:15:49Z

api/src/main/java/org/apache/iceberg/variants/VariantDataUtil.java

+    return current;
+  }
+
+  public static ByteBuffer serializeBounds(Map<String, VariantValue> bounds) {


This serialization is temporary and will be replaced with a well-defined variant serialization.

rdblue · 2025-02-21T18:58:17Z

This needed changes to TestSparkScan because adding support for transforms filters out additional files in unpartitioned tables.

rdblue · 2025-02-21T21:38:47Z

This now relies on #12374 to move the serialized variant classes into API so that InclusiveMetricsEvaluator can use them for deserializing bounds.

api/src/main/java/org/apache/iceberg/variants/VariantUtil.java

rdblue · 2025-02-26T00:48:20Z

Thanks for reviewing, @danielcweeks!

github-actions bot added API core labels Feb 18, 2025

rdblue commented Feb 18, 2025

View reviewed changes

rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from 0c80b18 to 09fcf7b Compare February 18, 2025 21:20

Fokko self-requested a review February 20, 2025 09:42

rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch 2 times, most recently from 12b7f7c to b1b068d Compare February 21, 2025 17:59

github-actions bot added the spark label Feb 21, 2025

rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from b1b068d to 8faa5f1 Compare February 21, 2025 18:01

rdblue mentioned this pull request Feb 21, 2025

API: Move Variant interfaces and serialized implementations to API #12374

Merged

rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from 8faa5f1 to bcdea90 Compare February 21, 2025 21:37

rdblue added 6 commits February 21, 2025 15:18

API: Update InclusiveMetricsEvaluator for transforms, extract.

4d74173

Core: Add tests for InclusiveMetricsEvaluator with Extract functions.

9417341

Core: Add tests for InclusiveMetricsEvaluator with transforms.

ccb8d21

Fix TestSparkScan.

8b52941

API: Use VariantObject for serialized lower/upper bounds.

c3c451e

API: Remove unnecessary util class VariantDataUtil.

d1a260c

rdblue force-pushed the variant-update-inclusive-metrics-evaluator branch from 37a0c04 to d1a260c Compare February 21, 2025 23:19

Fix TestSparkScan in 3.5.

7f17553

danielcweeks reviewed Feb 25, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/variants/VariantUtil.java Outdated Show resolved Hide resolved

Move castTo into VariantExpressionUtil.

31362e8

danielcweeks approved these changes Feb 25, 2025

View reviewed changes

rdblue merged commit 2f88ff6 into apache:main Feb 26, 2025
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

rdblue commented Feb 18, 2025 •

edited

Loading

rdblue Feb 18, 2025

rdblue Feb 18, 2025

rdblue commented Feb 21, 2025

rdblue commented Feb 21, 2025

rdblue commented Feb 26, 2025

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

Conversation

rdblue commented Feb 18, 2025 • edited Loading

rdblue Feb 18, 2025

Choose a reason for hiding this comment

rdblue Feb 18, 2025

Choose a reason for hiding this comment

rdblue commented Feb 21, 2025

rdblue commented Feb 21, 2025

rdblue commented Feb 26, 2025

rdblue commented Feb 18, 2025 •

edited

Loading