[SPARK-49601][SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas #48005

jingz-db · 2024-09-05T23:15:56Z

What changes were proposed in this pull request?

This PR adds support for users to provide a Dataframe that can be used to instantiate state for the query in the first batch for arbitrary state API v2 in Python.
The Scala PR for supporting initial state is here: #45467

We propose to create a new PythonRunner that handles initial state specifically for TransformWithStateInPandas. From JVM, we coGroup input rows and initial state rows on the same grouping key. Then we create a new row that contains one row in the input rows iterator and one row in the initial state iterator, and send the new grouped row to Py4j. Inside the python worker, we deserialize the grouped row into input rows and initial state rows separately and input those into handleInitialState and handleInputRows.
We will launch a python worker for each partition that has a non-empty input rows in either input rows or initial states. This will guarantee all keys in the initial state will be processed even if they do not appear in the first batch or they don't lie in the same partition with keys in the first batch.

Why are the changes needed?

We need to couple the API as we support initial state handling in Scala.

Does this PR introduce any user-facing change?

Yes.
This PR introduces a new API in the StatefulProcessor which allows users to define their own udf for processing initial state:

 def handleInitialState(
        self, key: Any, initialState: "PandasDataFrameLike"
    ) -> None:

The implementation of this function is optional. If not defined, then it will act as no-op.

How was this patch tested?

Unit tests & integration tests.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/java/org/apache/spark/sql/execution/streaming/StateMessage.proto

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

bogao007 · 2024-09-27T21:23:47Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

      case _ =>
        throw new IllegalArgumentException("Invalid method call")
    }
  }

+  private def handleStatefulProcessorUtilRequest(message: UtilsCallCommand): Unit = {


Should we add some scala unit tests for these 2 new APIs?

python/pyspark/sql/streaming/stateful_processor_api_client.py

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

bogao007 · 2024-10-07T22:36:13Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+        yield pd.DataFrame({"id": key, "value": str(accumulated_value)})
+
+    def handleInitialState(self, key, initialState) -> None:
+        initVal = initialState.at[0, "initVal"]


Can we add verifications on the initVal here?

python/pyspark/sql/pandas/group_ops.py

python/pyspark/sql/streaming/stateful_processor.py

bogao007 · 2024-10-11T21:30:24Z

python/pyspark/sql/pandas/group_ops.py

@@ -402,6 +404,9 @@ def transformWithStateInPandas(
            The output mode of the stateful processor.
        timeMode : str
            The time mode semantics of the stateful processor for timers and TTL.
+        initialState: "GroupedData"


Let's use something like below to represent the actual type.

:class:`pyspark.sql.types.DataType`

python/pyspark/sql/pandas/group_ops.py

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

python/pyspark/sql/streaming/stateful_processor_api_client.py

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

bogao007 · 2024-10-11T22:12:10Z

python/pyspark/sql/pandas/group_ops.py

+        ) -> Iterator["PandasDataFrameLike"]:
+            handle = StatefulProcessorHandle(statefulProcessorApiClient)
+
+            if statefulProcessorApiClient.handle_state == StatefulProcessorHandleState.CREATED:


There's something not very clear to me here, could you help me understand more?

We only call handleInitialState when handle state is CREATED, but after we processed the initial state of the first grouping key, we update the state to be INITIALIZED. Wouldn't that skip the initial state for other grouping keys?

If my understanding is correct, we should move the handleInitialState outside the handle state check, do it after the init call.

You are correct. I also moved out the code block and run a local test with partition number as "1" to confirm the implementation is correct.

bogao007 · 2024-10-11T22:32:10Z

python/pyspark/sql/pandas/group_ops.py

+                statefulProcessorApiClient: StatefulProcessorApiClient,
+                key: Any,
+                inputRows: Iterator["PandasDataFrameLike"],
+                # for non first batch, initialStates will be None


For non first batch, would initialStates be None or empty?

Added the above in the comments with other input combinations.

Would be None. This is a bit hacky. We pass in the python eval type purely based on whether the input initialState dataframe is None or not. For non-empty input initial state and non first batch, we will still eval UDF as transformWithStateWithInitStateUDF here. As JVM will start a eval type of transformWithStateUDF PythonRunner for non first batch, we will get initialStates as None as it is the positional value: initialStates: Iterator["PandasDataFrameLike"] = None

bogao007 · 2024-10-11T22:37:00Z

python/pyspark/sql/pandas/group_ops.py

+                inputRows: Iterator["PandasDataFrameLike"],
+                # for non first batch, initialStates will be None
+                initialStates: Iterator["PandasDataFrameLike"] = None
+        ) -> Iterator["PandasDataFrameLike"]:


Can we add some commentss on the possible input combinations that we need to handle in this udf for people to understand easier? IIUC there should be 3 cases:

Both inputRows and initialStates contain data. This would only happen in the first batch and the associated grouping key contains both input data and initial state.

Only inputRows contains data. This could happen when either the grouping key doesn't have any initial state to process or it's non first batch.

Only initialStates contains data. This could happen when the grouping key doesn't have any associated input data but it has initial state to process.

Add the above in the comment.

bogao007

LGTM overall, just some nits.

bogao007 · 2024-10-21T17:41:10Z

python/pyspark/sql/pandas/group_ops.py

+                seen_init_state_on_key = False
+                for cur_initial_state in initialStates:
+                    if seen_init_state_on_key:
+                        raise Exception(f"TransformWithStateWithInitState: Cannot have more "


Nit: let's include the TODO for classifying the errors here.

I am removing this check as we'll allow multiple value rows for the same grouping key as part of the integration of supporting initial state handling with state reader source (for flattened list/map state, there will be multiple value rows with the same grouping key in the output dataframe).

python/pyspark/sql/pandas/group_ops.py

HeartSaVioR

First pass

HeartSaVioR · 2024-11-04T01:05:38Z

python/pyspark/sql/pandas/group_ops.py

@@ -409,6 +410,9 @@ def transformWithStateInPandas(
            The output mode of the stateful processor.
        timeMode : str
            The time mode semantics of the stateful processor for timers and TTL.
+        initialState : :class:`pyspark.sql.GroupedData`
+            Optional. The grouped dataframe on given grouping key as initial states used for initialization


nit: Now the method doc for Scala version and PySpark version are diverged, not only for the type (which is expected) but also the description itself.

For example, here is the explanation of initialState in Scala API:

User provided initial state that will be used to initiate state for the query in the first batch.

Probably better to revisit both API doc at some point and sync between twos.

Before doing that, I think the part on given grouping key is redundant, and makes confusion. We should have checked the compatibility of the grouping key between two groups (current Dataset, and Dataset for initialState), right? If then we could just remove it.

HeartSaVioR · 2024-11-04T01:10:16Z

python/pyspark/sql/pandas/group_ops.py

+            """
+            UDF for TWS operator with non-empty initial states. Possible input combinations
+            of inputRows and initialStates iterator:
+            - Both `inputRows` and `initialStates` are non-empty: for the given key, both input rows


nit: both input rows and initial states contains the grouping key sound to be redundant since we call out for the given key. inputRows and initialStates are expected to be flatten Dataset (not grouped one), right? Their grouping key is the given key.

ditto for all others

Good points! Removed redundant words.

HeartSaVioR · 2024-11-04T01:14:42Z

python/pyspark/sql/pandas/group_ops.py

+            of inputRows and initialStates iterator:
+            - Both `inputRows` and `initialStates` are non-empty: for the given key, both input rows
+              and initial states contains the grouping key, both input rows and initial states contains data.
+            - `InitialStates` is non-empty, while `initialStates` is empty. For the given key, only


nit: InitialStates is non-empty, while initialStates is empty.

you may want to change either one.

HeartSaVioR · 2024-11-04T01:15:56Z

python/pyspark/sql/pandas/group_ops.py

+              initial states contains the grouping key and data, and it is first batch.
+            - `initialStates` is empty, while `inputRows` is not empty. For the given grouping key, only inputRows
+              contains the grouping key and data, and it is first batch.
+            - `initialStates` is None, while `inputRows` is not empty. This is not first batch. `initialStates`


This represents the difference between an empty Dataset (or iterator) and None, right? Just to make clear.

Yes, empty Dataset is different from None. When we are in non-first batch, initialStates will be None.

HeartSaVioR · 2024-11-04T01:18:38Z

python/pyspark/sql/pandas/group_ops.py

+
+            # only process initial state if first batch
+            is_first_batch = statefulProcessorApiClient.is_first_batch()
+            if is_first_batch and initialStates is not None:


I'd expect caller to handle this; providing initialStates for non-first batch is already adding unnecessary overhead and ideally caller should provide None for non-first batch. I'm OK to double check here for safety purpose, but maybe I'd do opposite, assert that (!is_first_batch and initialStates is None) is True.

Yeah we are only making an API call for safety purpose and it introduces small overhead. I am removing the check entirely as you commented below, the API itself is a bit confusing.

HeartSaVioR · 2024-11-04T05:01:27Z

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

+    funcs, evalType, argOffsets, dataSchema, processorHandle, _timeZoneId,
+    initialWorkerConf, pythonMetrics, jobArtifactUUID, groupingKeySchema,
+    batchTimestampMs, eventTimeWatermarkForEviction, hasInitialState)
+    with PythonArrowInput[GroupedInType] {


HeartSaVioR · 2024-11-04T05:02:47Z

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

+    eventTimeWatermarkForEviction: Option[Long],
+    hasInitialState: Boolean)
+  extends BasePythonRunner[I, ColumnarBatch](funcs.map(_._1), evalType, argOffsets, jobArtifactUUID)
+    with PythonArrowInput[I]


ditto for all with lines

HeartSaVioR · 2024-11-04T05:03:50Z

...ain/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasPythonRunner.scala

+      writer: ArrowStreamWriter,
+      dataOut: DataOutputStream,
+      inputIterator:
+      Iterator[GroupedInType]): Boolean = {


nit: shifting one line above (any reason it's placed to the next line?)

If the combined line exceeds 100 chars, : Boolean = { should only be in this line, with 2 spaces shifted left from parameters.

HeartSaVioR · 2024-11-04T05:12:04Z

python/pyspark/sql/pandas/serializers.py

+                )
+                return table_from_fields
+
+            for batch in batches:


Maybe better to have a brief comment about how the batch has constructed or some characteristic, or even where to read the code to understand the data structure. Personally I read this code before reading the part of building batch, and have to make an assumption that a batch must only have data from a single grouping key, otherwise it won't work.

HeartSaVioR · 2024-11-04T05:44:05Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

@@ -536,6 +536,108 @@ def check_results(batch_df, batch_id):
            EventTimeStatefulProcessor(), check_results
        )

+    def _test_transform_with_state_init_state_in_pandas(self, stateful_processor, check_results):
+        input_path = tempfile.mkdtemp()
+        self._prepare_test_resource1(input_path)


I see you are covering both cases in this test, which is great!

grouping key in input, but not in initial state (1)

grouping key in initial state, but not in input (3)

HeartSaVioR · 2024-11-04T05:48:37Z

https://github.com/jingz-db/spark/actions/runs/11620673481/job/32364544070
linter error seems to be valid

HeartSaVioR · 2024-11-05T01:45:06Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+            SimpleStatefulProcessorWithInitialState(), check_results
+        )
+
+    def _test_transform_with_state_non_contiguous_grouping_cols(


shall we have the same test (non-contiguous grouping keys) for the path of initial state for completeness sake?

HeartSaVioR

Second pass, I added a couple comments to address. Looks good to me otherwise.

jingz-db · 2024-11-05T21:57:45Z

python/pyspark/util.py

@@ -567,7 +568,9 @@ class PythonEvalType:
    SQL_GROUPED_MAP_ARROW_UDF: "ArrowGroupedMapUDFType" = 209
    SQL_COGROUPED_MAP_ARROW_UDF: "ArrowCogroupedMapUDFType" = 210
    SQL_TRANSFORM_WITH_STATE_PANDAS_UDF: "PandasGroupedMapUDFTransformWithStateType" = 211
-
+    SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF: "PandasGroupedMapUDFTransformWithStateInitStateType" = (  # noqa: E501


Had to add the #noqa here else we won't pass ./dev/lint-python or flake8 check.

HeartSaVioR

+1

HeartSaVioR · 2024-11-06T01:50:59Z

Thanks! Merging to master.

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Sep 5, 2024

jingz-db changed the title ~~[SS][PYTHON] Support Initial State for TransformWithStateInPandas~~ [SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas Sep 5, 2024

jingz-db marked this pull request as ready for review September 9, 2024 16:48

github-actions bot added the DOCS label Sep 9, 2024

jingz-db changed the title ~~[SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas~~ [SPARK-49601][SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas Sep 11, 2024

jingz-db force-pushed the python-init-state-impl branch from 253e56d to 099d827 Compare September 12, 2024 19:01

bogao007 reviewed Sep 27, 2024

View reviewed changes

bogao007 reviewed Oct 7, 2024

View reviewed changes

github-actions bot added the CORE label Oct 10, 2024

jingz-db requested a review from bogao007 October 11, 2024 09:27

bogao007 reviewed Oct 11, 2024

View reviewed changes

jingz-db requested a review from bogao007 October 14, 2024 18:24

bogao007 approved these changes Oct 21, 2024

View reviewed changes

jingz-db added 4 commits October 31, 2024 14:51

resolve conflicts

9dad0bd

resolve conflicts

4dd8179

refactor initial state udf with timer

a38d8f2

add license and linter ignore

45459d9

jingz-db force-pushed the python-init-state-impl branch from 8e90c2e to 45459d9 Compare October 31, 2024 22:06

jingz-db added 2 commits October 31, 2024 15:10

fix scala suites

7780e34

style

a2fda7a

HeartSaVioR reviewed Nov 4, 2024

View reviewed changes

jingz-db added 4 commits November 4, 2024 11:16

resolve partial comments

3dbeada

add projection for non-initial state

67bd19b

resolve comments

b6500c9

lint

b5bd82f

jingz-db requested a review from HeartSaVioR November 4, 2024 23:07

HeartSaVioR reviewed Nov 5, 2024

View reviewed changes

jingz-db added 2 commits November 5, 2024 10:47

add a test case for init state with non-contiguous grouping keys

ee13a6a

lint

8f182ec

jingz-db commented Nov 5, 2024

View reviewed changes

jingz-db requested a review from HeartSaVioR November 5, 2024 22:47

HeartSaVioR approved these changes Nov 6, 2024

View reviewed changes

HeartSaVioR closed this in 36410f0 Nov 6, 2024

[SPARK-49601][SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas #48005

[SPARK-49601][SS][PYTHON] Support Initial State Handling for TransformWithStateInPandas #48005

Uh oh!

Conversation

jingz-db commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jingz-db commented Sep 5, 2024 •

edited

Loading

HeartSaVioR Nov 5, 2024 •

edited

Loading