[SPARK-52195][PYTHON][SS] Fix initial state column dropping issue for Python TWS #50926

bogao007 · 2025-05-16T22:00:53Z

What changes were proposed in this pull request?

Fix initial state column dropping issue for Python TWS. This may occur when user adds extra transformations after TransformWithStateInPandas operator and those initial state columns will get pruned during optimization.

Why are the changes needed?

This prevents users to use initial state with TransformWithStateInPandas if they require extra transformations.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test case.

Was this patch authored or co-authored using generative AI tooling?

No.

anishshri-db · 2025-05-16T22:09:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

@@ -811,7 +811,7 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
          isStreaming = true,
          hasInitialState,
          planLater(initialState),
-          t.rightAttributes,
+          t.rightAttributes(),


why do we need this ?

Because we have changed the rightAttributes to take parameters, even though the parameter has a default value, the compiler still requires parentheses

Why do we pass includesInitialStateColumns as false here while we pass includesInitialStateColumns as true inside references?

Here we need to pass initialStateGroupingAttrs as the input of TransformWithStateInPySparkExec which should not include other initial state columns. We only need to add these columns in references.

jingz-db · 2025-05-16T22:38:50Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

    assert(resolved, "This method is expected to be called after resolution.")
    if (hasInitialState) {
-      right.output.take(initGroupingAttrsLen)
+      if (includesInitialStateColumns) {
+        // Include the initial state columns in the references to avoid being column pruned.


If I understand correctly for your PR descrption, the column pruning happens inside optimizer? Do you have a code pointer to where in the optimizer that the column get pruned?

yeah it happened when Spark applies ColumnPruning rule. Since we didn't add these columns to references, ColumnPruning rule thinks these columns can be dropped.

jingz-db

Approved and left some (non-blocking but) curious questions. Thanks for making the change! It is a difficult debug and thanks for your efforts!

HyukjinKwon · 2025-05-19T00:24:57Z

cc @HeartSaVioR

HeartSaVioR

Thanks for the fix. Just a one suggestion. I'm not enforcing this - I just feel it'd be more clear. I'm OK if folks think this doesn't need more change.

HeartSaVioR · 2025-05-17T00:51:55Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

@@ -215,10 +216,15 @@ case class TransformWithStateInPySpark(
    left.output.take(groupingAttributesLen)
  }

-  def rightAttributes: Seq[Attribute] = {
+  def rightAttributes(includesInitialStateColumns: Boolean = false): Seq[Attribute] = {


Let's not make a single method to do two different purposes. Shall we have rightReferences to cover the new case?

sure, updated

HeartSaVioR · 2025-05-20T13:17:18Z

@bogao007 Sorry, but could you please re-trigger the CI via empty commit, or just this module https://github.com/bogao007/spark/actions/runs/15123895213/job/42527399451 in the Github UI? I'd like to make sure any relevant modules aren't failing. Thanks!

Fix initial state column dropping issue for Python TWS

ef7d959

github-actions bot added SQL PYTHON labels May 16, 2025

anishshri-db reviewed May 16, 2025

View reviewed changes

anishshri-db approved these changes May 16, 2025

View reviewed changes

jingz-db reviewed May 16, 2025

View reviewed changes

jingz-db approved these changes May 16, 2025

View reviewed changes

lint

17f7a0e

HyukjinKwon approved these changes May 19, 2025

View reviewed changes

HeartSaVioR reviewed May 19, 2025

View reviewed changes

bogao007 added 2 commits May 19, 2025 14:48

addressed comment

3f2fa80

minor

fd914a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52195][PYTHON][SS] Fix initial state column dropping issue for Python TWS #50926

[SPARK-52195][PYTHON][SS] Fix initial state column dropping issue for Python TWS #50926

bogao007 commented May 16, 2025 •

edited

Loading

anishshri-db May 16, 2025

bogao007 May 16, 2025

jingz-db May 16, 2025 •

edited

Loading

bogao007 May 16, 2025 •

edited

Loading

jingz-db May 16, 2025 •

edited

Loading

bogao007 May 16, 2025

jingz-db left a comment •

edited

Loading

HyukjinKwon commented May 19, 2025

HeartSaVioR left a comment

HeartSaVioR May 17, 2025

bogao007 May 19, 2025

HeartSaVioR commented May 20, 2025

[SPARK-52195][PYTHON][SS] Fix initial state column dropping issue for Python TWS #50926

Are you sure you want to change the base?

[SPARK-52195][PYTHON][SS] Fix initial state column dropping issue for Python TWS #50926

Conversation

bogao007 commented May 16, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

anishshri-db May 16, 2025

Choose a reason for hiding this comment

bogao007 May 16, 2025

Choose a reason for hiding this comment

jingz-db May 16, 2025 • edited Loading

Choose a reason for hiding this comment

bogao007 May 16, 2025 • edited Loading

Choose a reason for hiding this comment

jingz-db May 16, 2025 • edited Loading

Choose a reason for hiding this comment

bogao007 May 16, 2025

Choose a reason for hiding this comment

jingz-db left a comment • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented May 19, 2025

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR May 17, 2025

Choose a reason for hiding this comment

bogao007 May 19, 2025

Choose a reason for hiding this comment

HeartSaVioR commented May 20, 2025

bogao007 commented May 16, 2025 •

edited

Loading

jingz-db May 16, 2025 •

edited

Loading

bogao007 May 16, 2025 •

edited

Loading

jingz-db May 16, 2025 •

edited

Loading

jingz-db left a comment •

edited

Loading