apache · zhengruifeng · Sep 10, 2024
diff --git a/...ore/src/main/scala/org/apache/spark/sql/execution/python/FlatMapCoGroupsInArrowExec.scala b/...ore/src/main/scala/org/apache/spark/sql/execution/python/FlatMapCoGroupsInArrowExec.scala
@@ -23,21 +23,21 @@ import org.apache.spark.sql.execution.SparkPlan
 
 
 /**
- * Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas]]
+ * Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow]]
  *
  * The input dataframes are first Cogrouped.  Rows from each side of the cogroup are passed to the
  * Python worker via Arrow.  As each side of the cogroup may have a different schema we send every
  * group in its own Arrow stream.
- * The Python worker turns the resulting record batches to `pandas.DataFrame`s, invokes the
- * user-defined function, and passes the resulting `pandas.DataFrame`
+ * The Python worker turns the resulting record batches to `pyarrow.Table`s, invokes the
+ * user-defined function, and passes the resulting `pyarrow.Table`
  * as an Arrow record batch. Finally, each record batch is turned to
  * Iterator[InternalRow] using ColumnarBatch.
  *
  * Note on memory usage:
  * Both the Python worker and the Java executor need to have enough memory to
  * hold the largest cogroup. The memory on the Java side is used to construct the
  * record batches (off heap memory). The memory on the Python side is used for
- * holding the `pandas.DataFrame`. It's possible to further split one group into
+ * holding the `pyarrow.Table`. It's possible to further split one group into
  * multiple record batches to reduce the memory footprint on the Java side, this
  * is left as future work.
  */

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInArrowExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInArrowExec.scala
@@ -25,19 +25,19 @@ import org.apache.spark.sql.types.{StructField, StructType}
 
 
 /**
- * Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]]
+ * Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow]]
  *
  * Rows in each group are passed to the Python worker as an Arrow record batch.
- * The Python worker turns the record batch to a `pandas.DataFrame`, invoke the
- * user-defined function, and passes the resulting `pandas.DataFrame`
+ * The Python worker turns the record batch to a `pyarrow.Table`, invokes the
+ * user-defined function, and passes the resulting `pyarrow.Table`
  * as an Arrow record batch. Finally, each record batch is turned to
  * Iterator[InternalRow] using ColumnarBatch.
  *
  * Note on memory usage:
  * Both the Python worker and the Java executor need to have enough memory to
  * hold the largest group. The memory on the Java side is used to construct the
  * record batch (off heap memory). The memory on the Python side is used for
- * holding the `pandas.DataFrame`. It's possible to further split one group into
+ * holding the `pyarrow.Table`. It's possible to further split one group into
  * multiple record batches to reduce the memory footprint on the Java side, this
  * is left as future work.
  */