[SPARK-48508][CONNECT][PYTHON] Cache user specified schema in `DataFr…

…ame.{to, mapInPandas, mapInArrow}` ### What changes were proposed in this pull request? Cache user specified schema in `DataFrame.{to, mapInPandas, mapInArrow}` ### Why are the changes needed? to avoid extra RPC to get the schema ### Does this PR introduce _any_ user-facing change? no, it should only be an optimization ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46848 from zhengruifeng/py_user_define_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
tlm365 · Jun 3, 2024 · 6272c05 · 6272c05
1 parent cfb79d9
commit 6272c05
Showing 1 changed file with 7 additions and 2 deletions.
diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py
@@ -1825,10 +1825,12 @@ def inputFiles(self) -> List[str]:
 
     def to(self, schema: StructType) -> ParentDataFrame:
         assert schema is not None
-        return DataFrame(
+        res = DataFrame(
             plan.ToSchema(child=self._plan, schema=schema),
             session=self._session,
         )
+        res._cached_schema = schema
+        return res
 
     def toDF(self, *cols: str) -> ParentDataFrame:
         for col_ in cols:
@@ -2009,7 +2011,7 @@ def _map_partitions(
             evalType=evalType,
         )
 
-        return DataFrame(
+        res = DataFrame(
             plan.MapPartitions(
                 child=self._plan,
                 function=udf_obj,
@@ -2019,6 +2021,9 @@ def _map_partitions(
             ),
             session=self._session,
         )
+        if isinstance(schema, StructType):
+            res._cached_schema = schema
+        return res
 
     def mapInPandas(
         self,