[SPARK-53928][SQL] Enhance DSV2 partition filtering using catalyst expression #52628

szehon-ho · 2025-10-15T22:43:15Z

What changes were proposed in this pull request?

Add new interfaces HasPartitionKeys and KeyedPartitioning to DSV2 to report partition values. These are a superset of HasPartitionKey and KeyGroupedPartitioning (which requires the data source to group its InputPartition by partition-values and is mainly for SPJ). Use this in Spark for further partition-column filtering.

Why are the changes needed?

Currently, Spark converts Catalyst Expression to either Filter or Predicate and pushes it to DSV2 via SupportsPushdownFilters and SupportsPushdownV2Filters API's.

However, some Spark filters may not convert cleanly. For example, trim(part_col) = 'a'. There are cases where DSV2 can return the exact partition value(s) to spark for its InputPartition, and Spark can use the original catalyst expression for filtering.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

…umn name

peter-toth · 2025-10-16T11:23:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

-      val filterableScan = scan.asInstanceOf[SupportsRuntimeV2Filtering]
-      filterableScan.filter(dataSourceFilters.toArray)
+    // Apply additional filtering based on partition keys if available
+    if (allFilters.nonEmpty) {


Seems like when you create BatchScanExec you pass in allFilters, which contains both runtimeFilters and postScanFilters, but we already have runtimeFilters.
Would it make sense to just pass in postScanFilters and compute allFilters here?

Can allFilters be computed as runtimeFilters that can't be translated to V2 + postScanFilters? Or we can't be sure that filterableScan.filter() applies all translated runtimeFilters?

peter-toth · 2025-10-16T11:35:47Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala

      }

-      d.copy(keyGroupedPartitioning = catalystPartitioning)
+      val catalystKeyedPartitioning = scan.outputPartitioning() match {


Since scan.outputPartitioning() is either KeyGroupedPartitioning or KeyedPartitioning (or something that we don't care about), can we merge the catalystPartitioning and catalystKeyedPartitioning computing matches into one?

szehon-ho added 2 commits October 1, 2025 23:03

[SPARK-53786][SQL] Default value should not conflict with special col…

b0350a5

…umn name

Double filter

764bf7c

github-actions bot added SQL AVRO labels Oct 15, 2025

peter-toth reviewed Oct 16, 2025

View reviewed changes

add more tests

2b07587

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53928][SQL] Enhance DSV2 partition filtering using catalyst expression #52628

[SPARK-53928][SQL] Enhance DSV2 partition filtering using catalyst expression #52628

szehon-ho commented Oct 15, 2025

Uh oh!

peter-toth Oct 16, 2025

Uh oh!

peter-toth Oct 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53928][SQL] Enhance DSV2 partition filtering using catalyst expression #52628

Are you sure you want to change the base?

[SPARK-53928][SQL] Enhance DSV2 partition filtering using catalyst expression #52628

Conversation

szehon-ho commented Oct 15, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-toth Oct 16, 2025 •

edited

Loading