[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

Yicong-Huang · 2025-10-23T22:05:24Z

What changes were proposed in this pull request?

This PR adds support for the Iterator[pandas.DataFrame] API in groupBy().applyInPandas(), enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability.

Key Changes:

New PythonEvalType: Added SQL_GROUPED_MAP_PANDAS_ITER_UDF to distinguish iterator-based UDFs from standard grouped map UDFs
Type Inference: Implemented automatic detection of iterator signatures:
- Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
- Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
Streaming Serialization: Created GroupPandasIterUDFSerializer that streams results without materializing all DataFrames in memory
Configuration Change: Updated FlatMapGroupsInPandasExec which was hardcoding pythonEvalType = 201 instead of extracting it from the UDF expression (mirrored fix from FlatMapGroupsInArrowExec)

Why are the changes needed?

The existing applyInPandas() API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows:

Memory Efficiency: Process data batch-by-batch instead of materializing entire groups
Scalability: Handle arbitrarily large groups that don't fit in memory
Consistency: Mirrors the existing applyInArrow() iterator API design

Does this PR introduce any user-facing changes?

Yes, this PR adds a new API variant for applyInPandas():

Before (existing API, still supported):

def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    return pdf.assign(v=(pdf.v - pdf.v.mean()) / pdf.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

After (new iterator API):

from typing import Iterator

def normalize(batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    # Process data batch-by-batch
    for batch in batches:
        yield batch.assign(v=(batch.v - batch.v.mean()) / batch.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

With Grouping Keys:

from typing import Iterator, Tuple, Any

def sum_by_key(key: Tuple[Any, ...], batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    total = 0
    for batch in batches:
        total += batch['v'].sum()
    yield pd.DataFrame({"id": [key[0]], "total": [total]})

df.groupBy("id").applyInPandas(sum_by_key, schema="id long, total double")

Backward Compatibility: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes.

How was this patch tested?

Added test_apply_in_pandas_iterator_basic - Basic functionality test
Added test_apply_in_pandas_iterator_with_keys - Test with grouping keys
Added test_apply_in_pandas_iterator_batch_slicing - Pressure test with 10M rows, 20 columns
Added test_apply_in_pandas_iterator_with_keys_batch_slicing - Pressure test with keys

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Cursor.

python/pyspark/sql/pandas/group_ops.py

python/pyspark/sql/pandas/serializers.py

python/pyspark/worker.py

python/pyspark/sql/pandas/serializers.py

zhengruifeng

LGTM, only a few minor comments

zhengruifeng · 2025-10-30T01:06:12Z

python/pyspark/sql/connect/group.py

@@ -294,14 +294,26 @@ def applyInPandas(
    ) -> "DataFrame":
        from pyspark.sql.connect.udf import UserDefinedFunction
        from pyspark.sql.connect.dataframe import DataFrame
+        from pyspark.sql.pandas.typehints import infer_group_pandas_eval_type_from_func
+        import warnings


Suggested change

import warnings

zhengruifeng · 2025-10-30T01:13:56Z

python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py

+                    self.assertEqual(expected, result)
+
+    def test_apply_in_pandas_iterator_with_keys_batch_slicing(self):
+        from typing import Iterator, Tuple, Any


such imports should be move to the head of the file

zhengruifeng · 2025-10-30T01:14:20Z

python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py

+
+    def test_apply_in_pandas_iterator_process_multiple_input_batches(self):
+        from typing import Iterator
+        import builtins


why we need import builtins?
I think there is no name conflict if we use sf.max/min/sum in this file

somehow when I use sum directly it would use column.sum. Do you know the reason? I changed to use builtins to avoid this conflict

moved typing import

we don't have column.sum, do you mean sf.sum?
in some test files, sum is imported, so the builtin sum is overridden

zhengruifeng · 2025-10-30T01:15:35Z

python/pyspark/sql/tests/pandas/test_pandas_grouped_map.py

+        )
+
+        # Verify that all rows are present after concatenation
+        self.assertEqual(len(result), 6)


let's directly compare the rows

self.assertEqual(result, [Row(...), Row(...), ...])

zhengruifeng · 2025-10-30T09:01:54Z

python/pyspark/sql/pandas/typehints.py

+    if is_iterator_dataframe or is_iterator_dataframe_with_keys:
+        return PythonEvalType.SQL_GROUPED_MAP_PANDAS_ITER_UDF
+
+    # Default to non-iterator (standard grouped map)
+    return PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF


Suggested change

if is_iterator_dataframe or is_iterator_dataframe_with_keys:

return PythonEvalType.SQL_GROUPED_MAP_PANDAS_ITER_UDF

# Default to non-iterator (standard grouped map)

return PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF

if is_iterator_dataframe or is_iterator_dataframe_with_keys:

return PythonEvalType.SQL_GROUPED_MAP_PANDAS_ITER_UDF

# Default to non-iterator (standard grouped map)

return PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF

this part should match

spark/python/pyspark/sql/pandas/typehints.py

Lines 368 to 379 in 9e12201

# pa.Table -> pa.Table

is_table = (

len(parameters_sig) == 1 and parameters_sig[0] == pa.Table and return_annotation == pa.Table

)

# Tuple[pa.Scalar, ...], pa.Table -> pa.Table

is_table_with_keys = (

len(parameters_sig) == 2 and parameters_sig[1] == pa.Table and return_annotation == pa.Table

)

if is_table or is_table_with_keys:

return PythonEvalType.SQL_GROUPED_MAP_ARROW_UDF

return None

we can align it in a followup

let's discuss and do it with a follow up if needed.

zhengruifeng · 2025-10-31T06:33:34Z

thanks, merged to master

…plyInPandas` ### What changes were proposed in this pull request? This PR adds support for the `Iterator[pandas.DataFrame] API` in `groupBy().applyInPandas()`, enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability. #### Key Changes: 1. **New PythonEvalType**: Added `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to distinguish iterator-based UDFs from standard grouped map UDFs 2. **Type Inference**: Implemented automatic detection of iterator signatures: - `Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]` - `Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]` 3. **Streaming Serialization**: Created `GroupPandasIterUDFSerializer` that streams results without materializing all DataFrames in memory 4. **Configuration Change**: Updated `FlatMapGroupsInPandasExec` which was hardcoding `pythonEvalType = 201` instead of extracting it from the UDF expression (mirrored fix from `FlatMapGroupsInArrowExec`) ### Why are the changes needed? The existing `applyInPandas()` API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows: - **Memory Efficiency**: Process data batch-by-batch instead of materializing entire groups - **Scalability**: Handle arbitrarily large groups that don't fit in memory - **Consistency**: Mirrors the existing `applyInArrow()` iterator API design ### Does this PR introduce any user-facing changes? Yes, this PR adds a new API variant for `applyInPandas()`: #### Before (existing API, still supported): ```python def normalize(pdf: pd.DataFrame) -> pd.DataFrame: return pdf.assign(v=(pdf.v - pdf.v.mean()) / pdf.v.std()) df.groupBy("id").applyInPandas(normalize, schema="id long, v double") ``` #### After (new iterator API): ```python from typing import Iterator def normalize(batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]: # Process data batch-by-batch for batch in batches: yield batch.assign(v=(batch.v - batch.v.mean()) / batch.v.std()) df.groupBy("id").applyInPandas(normalize, schema="id long, v double") ``` #### With Grouping Keys: ```python from typing import Iterator, Tuple, Any def sum_by_key(key: Tuple[Any, ...], batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]: total = 0 for batch in batches: total += batch['v'].sum() yield pd.DataFrame({"id": [key[0]], "total": [total]}) df.groupBy("id").applyInPandas(sum_by_key, schema="id long, total double") ``` **Backward Compatibility**: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes. ### How was this patch tested? - Added `test_apply_in_pandas_iterator_basic` - Basic functionality test - Added `test_apply_in_pandas_iterator_with_keys` - Test with grouping keys - Added `test_apply_in_pandas_iterator_batch_slicing` - Pressure test with 10M rows, 20 columns - Added `test_apply_in_pandas_iterator_with_keys_batch_slicing` - Pressure test with keys ### Was this patch authored or co-authored using generative AI tooling? Yes, tests generated by Cursor. Closes apache#52716 from Yicong-Huang/SPARK-53614/feat/add-apply-in-pandas. Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Co-authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…lizer` with `GroupPandasUDFSerializer` ### What changes were proposed in this pull request? This PR consolidates `GroupPandasUDFSerializer` to support both `SQL_GROUPED_MAP_PANDAS_UDF` and `SQL_GROUPED_MAP_PANDAS_ITER_UDF`, aligning with the design pattern used by `GroupArrowUDFSerializer`. ### Why are the changes needed? When `Iterator[pandas.DataFrame]` API was added to `groupBy().applyInPandas()` in SPARK-53614 (#52716), a new `GroupPandasIterUDFSerializer` class was created. However, this class is nearly identical to `GroupPandasUDFSerializer`, differing only in whether batches are processed lazily (iterator mode) or all at once (regular mode). ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective. ### How was this patch tested? Existing test cases. ### Was this patch authored or co-authored using generative AI tooling? Co-Generated-by: Cursor with Claude 4.5 Sonnet Closes #53043 from Yicong-Huang/SPARK-54316/refactor/consolidate-pandas-iter-serializer. Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added SQL CORE PYTHON CONNECT labels Oct 23, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

zhengruifeng reviewed Oct 28, 2025

View reviewed changes

python/pyspark/sql/pandas/group_ops.py Outdated Show resolved Hide resolved

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

python/pyspark/worker.py Outdated Show resolved Hide resolved

zhengruifeng reviewed Oct 29, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

zhengruifeng reviewed Oct 29, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

zhengruifeng changed the title ~~[SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 29, 2025

Yicong-Huang requested a review from zhengruifeng October 30, 2025 00:19

zhengruifeng approved these changes Oct 30, 2025

View reviewed changes

zhengruifeng reviewed Oct 30, 2025

View reviewed changes

github-actions bot added STRUCTURED STREAMING KUBERNETES BUILD DOCS INFRA DSTREAM AVRO PANDAS API ON SPARK labels Oct 30, 2025

Yicong-Huang force-pushed the SPARK-53614/feat/add-apply-in-pandas branch from e00112a to fe23a96 Compare October 30, 2025 18:34

github-actions bot removed STRUCTURED STREAMING KUBERNETES BUILD DOCS INFRA DSTREAM labels Oct 30, 2025

github-actions bot removed AVRO PANDAS API ON SPARK labels Oct 30, 2025

Yicong-Huang and others added 20 commits October 30, 2025 17:25

wip

b9084eb

add pressure test

5d1f4e6

fix return type

1400c43

add stream serializer

fb91b64

update example

be77741

add tests

5e7a41d

fix test

fed6e3f

fix iterator

7298fe9

take care of comments

6fe5597

fix

ba10a0a

fix: formatting and docs

ba45acf

fix: allow partial consuming of input iterator

13cc60f

fix: align with GroupPandasUDFSerializer

7b2aa79

fix: comments

2065228

reformat

323cbf5

fix typehint import

54c631b

reformat again

123284e

chore: update doc and types

c910dcc

fix: format

6166edf

fix: type infer

45bc03b

Yicong-Huang force-pushed the SPARK-53614/feat/add-apply-in-pandas branch from 9df4590 to 45bc03b Compare October 31, 2025 00:27

zhengruifeng closed this in 4202f23 Oct 31, 2025

Yicong-Huang mentioned this pull request Nov 13, 2025

[SPARK-54316][CORE][PYTHON][SQL] Consolidate GroupPandasIterUDFSerializer with GroupPandasUDFSerializer #53043

Closed

	# pa.Table -> pa.Table
	is_table = (
	len(parameters_sig) == 1 and parameters_sig[0] == pa.Table and return_annotation == pa.Table
	)
	# Tuple[pa.Scalar, ...], pa.Table -> pa.Table
	is_table_with_keys = (
	len(parameters_sig) == 2 and parameters_sig[1] == pa.Table and return_annotation == pa.Table
	)
	if is_table or is_table_with_keys:
	return PythonEvalType.SQL_GROUPED_MAP_ARROW_UDF

	return None

[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas #52716

[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas #52716

Uh oh!

Conversation

Yicong-Huang commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Key Changes:

Why are the changes needed?

Does this PR introduce any user-facing changes?

Before (existing API, still supported):

After (new iterator API):

With Grouping Keys:

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

Yicong-Huang commented Oct 23, 2025 •

edited

Loading

Yicong-Huang Oct 30, 2025 •

edited

Loading