[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157

xi-db · 2025-06-11T14:47:49Z

What changes were proposed in this pull request?

In Spark Connect, DataFrame.schema returns a deep copy of the schema to prevent unexpected behavior caused by user modifications to the returned schema object. However, if a user accesses df.schema repeatedly on a DataFrame with a complex schema, it can lead to noticeable performance degradation.

The performance issue can be reproduced using the code snippet below. Since copy.deepcopy is known to be slow to handle complex objects, this PR replaces it with pickle-based ser/de to improve the performance of df.schema access. Given the limitations of pickle, the implementation falls back to deepcopy in cases where pickling fails.

from pyspark.sql.types import StructType, StructField, StringType

def make_nested_struct(level, max_level, fields_per_level):
    if level == max_level - 1:
        return StructType(
            [StructField(f"f{level}_{i}", StringType(), True) for i in range(fields_per_level)])
    else:
        return StructType(
            [StructField(f"s{level}_{i}",
                         make_nested_struct(level + 1, max_level, fields_per_level), True) for i in
             range(fields_per_level)])

# Create a 4 level nested schema with in total 10,000 leaf fields
schema = make_nested_struct(0, 4, 10)

The existing needs 21.9s to copy the schema for 100 times.

import copy
timeit.timeit(lambda: copy.deepcopy(schema), number=100)
# 21.9

The updated approach only needs 2.0s to copy for 100 times:

from pyspark.serializers import CPickleSerializer
cached_schema_serialized = CPickleSerializer().dumps(schema)

timeit.timeit(lambda: CPickleSerializer().loads(cached_schema_serialized), number=100)
# 2.0

Why are the changes needed?

It improves the performance when calling df.schema many times.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests and new tests.

Was this patch authored or co-authored using generative AI tooling?

No.

xi-db · 2025-06-11T14:50:02Z

Hi @hvanhovell @vicennial , could you take a look at this PR? Thanks.

hvanhovell · 2025-06-11T14:57:14Z

python/pyspark/sql/connect/dataframe.py

+            try:
+                self._cached_schema_serialized = CPickleSerializer().dumps(self._schema)
+            except Exception as e:
+                logger.warn(f"DataFrame schema pickle dumps failed with exception: {e}.")


In what cases do we think this will happen?

I think it never happens because schema is nested Spark type classes. It shouldn't hit any of those special types that pickle doesn't support (link). Anyway, maybe we still need to handle the exception just in case. What do you think?

It would be nice to add some comments to the code to make it easier for future readers to understand. Currently, it's not very clear why a CPickleSerializer is used, or why an error is being handled without looking at the corresponding pull request.

Also, it may be clearer to create a function called _fast_cached_schema_deepcopy that caches the serialized schema and then deserializes it.

python/pyspark/sql/connect/dataframe.py

grundprinzip

Before merging would be great to test @hvanhovell's proposal to test the performance of reconstructing the schema from the proto response at all times and what the impact is.

python/pyspark/sql/connect/dataframe.py

Co-authored-by: Martin Grund <grundprinzip@gmail.com>

python/pyspark/sql/connect/dataframe.py

This reverts commit 4f4ae1a.

This reverts commit af1b276.

xi-db · 2025-06-18T18:00:17Z

Hi @zhengruifeng, could you help with the CI failures from pyspark-pandas-connect-part1? This PR has no changes on any scala code, but sql/hive, connector/kafka, and connect/server fail to compile due to sbt OutOfMemoryError. Do you have any idea here? I've retriggered CI, but it still failed. Thanks.

[error] ## Exception when compiling 156 sources to /__w/spark/spark/sql/hive/target/scala-2.13/test-classes
[error] java.lang.OutOfMemoryError: Java heap space
[error] 
[error]            
[error] ## Exception when compiling 21 sources to /__w/spark/spark/connector/kafka-0-10-sql/target/scala-2.13/test-classes
[error] java.lang.OutOfMemoryError: Java heap space
[error] 
[error]            
[error] ## Exception when compiling 41 sources to /__w/spark/spark/sql/connect/server/target/scala-2.13/test-classes
[error] java.lang.OutOfMemoryError: Java heap space
[error] 
[error]            
[warn] javac exited with exit code -1
[info] Compilation has been cancelled
[info] Compilation has been cancelled
[warn] In the last 10 seconds, 5.032 (50.6%) were spent in GC. [Heap: 2.45GB free of 4.00GB, max 4.00GB] Consider increasing the JVM heap using `-Xmx` or try a different collector, e.g. `-XX:+UseG1GC`, for better performance.
java.lang.OutOfMemoryError: Java heap space
Error:  [launcher] error during sbt launcher: java.lang.OutOfMemoryError: Java heap space

Update: Never mind, it works now. Thanks anyway.

hvanhovell · 2025-06-20T15:00:21Z

Merging to master/4.0. Thanks!

### What changes were proposed in this pull request? In Spark Connect, `DataFrame.schema` returns a deep copy of the schema to prevent unexpected behavior caused by user modifications to the returned schema object. However, if a user accesses `df.schema` repeatedly on a DataFrame with a complex schema, it can lead to noticeable performance degradation. The performance issue can be reproduced using the code snippet below. Since copy.deepcopy is known to be slow to handle complex objects, this PR replaces it with pickle-based ser/de to improve the performance of df.schema access. Given the limitations of pickle, the implementation falls back to deepcopy in cases where pickling fails. ``` from pyspark.sql.types import StructType, StructField, StringType def make_nested_struct(level, max_level, fields_per_level): if level == max_level - 1: return StructType( [StructField(f"f{level}_{i}", StringType(), True) for i in range(fields_per_level)]) else: return StructType( [StructField(f"s{level}_{i}", make_nested_struct(level + 1, max_level, fields_per_level), True) for i in range(fields_per_level)]) # Create a 4 level nested schema with in total 10,000 leaf fields schema = make_nested_struct(0, 4, 10) ``` The existing needs 21.9s to copy the schema for 100 times. ``` import copy timeit.timeit(lambda: copy.deepcopy(schema), number=100) # 21.9 ``` The updated approach only needs 2.0s to copy for 100 times: ``` from pyspark.serializers import CPickleSerializer cached_schema_serialized = CPickleSerializer().dumps(schema) timeit.timeit(lambda: CPickleSerializer().loads(cached_schema_serialized), number=100) # 2.0 ``` ### Why are the changes needed? It improves the performance when calling df.schema many times. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests and new tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51157 from xi-db/schema-deepcopy-improvement. Lead-authored-by: Xi Lyu <xi.lyu@databricks.com> Co-authored-by: Xi Lyu <159039256+xi-db@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit f502d66) Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? In Spark Connect, `DataFrame.schema` returns a deep copy of the schema to prevent unexpected behavior caused by user modifications to the returned schema object. However, if a user accesses `df.schema` repeatedly on a DataFrame with a complex schema, it can lead to noticeable performance degradation. The performance issue can be reproduced using the code snippet below. Since copy.deepcopy is known to be slow to handle complex objects, this PR replaces it with pickle-based ser/de to improve the performance of df.schema access. Given the limitations of pickle, the implementation falls back to deepcopy in cases where pickling fails. ``` from pyspark.sql.types import StructType, StructField, StringType def make_nested_struct(level, max_level, fields_per_level): if level == max_level - 1: return StructType( [StructField(f"f{level}_{i}", StringType(), True) for i in range(fields_per_level)]) else: return StructType( [StructField(f"s{level}_{i}", make_nested_struct(level + 1, max_level, fields_per_level), True) for i in range(fields_per_level)]) # Create a 4 level nested schema with in total 10,000 leaf fields schema = make_nested_struct(0, 4, 10) ``` The existing needs 21.9s to copy the schema for 100 times. ``` import copy timeit.timeit(lambda: copy.deepcopy(schema), number=100) # 21.9 ``` The updated approach only needs 2.0s to copy for 100 times: ``` from pyspark.serializers import CPickleSerializer cached_schema_serialized = CPickleSerializer().dumps(schema) timeit.timeit(lambda: CPickleSerializer().loads(cached_schema_serialized), number=100) # 2.0 ``` ### Why are the changes needed? It improves the performance when calling df.schema many times. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests and new tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51157 from xi-db/schema-deepcopy-improvement. Lead-authored-by: Xi Lyu <xi.lyu@databricks.com> Co-authored-by: Xi Lyu <159039256+xi-db@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

Replace copy.deepcopy with pickle ser/de

d1f3397

github-actions bot added SQL PYTHON CONNECT labels Jun 11, 2025

hvanhovell reviewed Jun 11, 2025

View reviewed changes

python/pyspark/sql/connect/dataframe.py Show resolved Hide resolved

grundprinzip reviewed Jun 11, 2025

View reviewed changes

python/pyspark/sql/connect/dataframe.py Outdated Show resolved Hide resolved

Update python/pyspark/sql/connect/dataframe.py

afdbc87

Co-authored-by: Martin Grund <grundprinzip@gmail.com>

hvanhovell approved these changes Jun 11, 2025

View reviewed changes

xi-db added 2 commits June 11, 2025 16:49

Fix tests

432ce58

Add comments

23f03f6

hvanhovell reviewed Jun 17, 2025

View reviewed changes

python/pyspark/sql/connect/dataframe.py Show resolved Hide resolved

xi-db added 2 commits June 18, 2025 08:19

Make the first call go through pickle as well.

4f4ae1a

Revert "Make the first call go through pickle as well."

af1b276

This reverts commit 4f4ae1a.

xi-db force-pushed the schema-deepcopy-improvement branch from 9d02fdf to af1b276 Compare June 18, 2025 14:41

xi-db added 2 commits June 18, 2025 15:12

Reapply "Make the first call go through pickle as well."

37ce721

This reverts commit af1b276.

Merge branch 'master' into schema-deepcopy-improvement

d4ffc3e

Merge branch 'master' into schema-deepcopy-improvement

be5765d

asf-gitbox-commits closed this in f502d66 Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157

[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157

Uh oh!

xi-db commented Jun 11, 2025

Uh oh!

xi-db commented Jun 11, 2025

Uh oh!

hvanhovell Jun 11, 2025

Uh oh!

xi-db Jun 11, 2025

Uh oh!

heyihong Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

grundprinzip left a comment

Uh oh!

Uh oh!

Uh oh!

xi-db commented Jun 18, 2025 •

edited

Loading

Uh oh!

hvanhovell commented Jun 20, 2025

Uh oh!

Uh oh!

[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157

[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157

Uh oh!

Conversation

xi-db commented Jun 11, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xi-db commented Jun 11, 2025

Uh oh!

hvanhovell Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

xi-db Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xi-db commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Jun 20, 2025

Uh oh!

Uh oh!

heyihong Jun 12, 2025 •

edited

Loading

xi-db commented Jun 18, 2025 •

edited

Loading