Skip to content

Commit

Permalink
[MINOR][DOCS] Fix scaladoc for FlatMapGroupsInArrowExec and `FlatMa…
Browse files Browse the repository at this point in the history
…pCoGroupsInArrowExec`

### What changes were proposed in this pull request?
Fix scaladoc for `FlatMapGroupsInArrowExec` and `FlatMapCoGroupsInArrowExec`

### Why are the changes needed?
existing scaladoc were actually copy-pasted from pandas ones

### Does this PR introduce _any_ user-facing change?
doc change

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48052 from zhengruifeng/py_type_applyinxxx.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
  • Loading branch information
zhengruifeng committed Sep 10, 2024
1 parent e918fb6 commit ab7aea1
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,21 @@ import org.apache.spark.sql.execution.SparkPlan


/**
* Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas]]
* Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow]]
*
* The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the
* Python worker via Arrow. As each side of the cogroup may have a different schema we send every
* group in its own Arrow stream.
* The Python worker turns the resulting record batches to `pandas.DataFrame`s, invokes the
* user-defined function, and passes the resulting `pandas.DataFrame`
* The Python worker turns the resulting record batches to `pyarrow.Table`s, invokes the
* user-defined function, and passes the resulting `pyarrow.Table`
* as an Arrow record batch. Finally, each record batch is turned to
* Iterator[InternalRow] using ColumnarBatch.
*
* Note on memory usage:
* Both the Python worker and the Java executor need to have enough memory to
* hold the largest cogroup. The memory on the Java side is used to construct the
* record batches (off heap memory). The memory on the Python side is used for
* holding the `pandas.DataFrame`. It's possible to further split one group into
* holding the `pyarrow.Table`. It's possible to further split one group into
* multiple record batches to reduce the memory footprint on the Java side, this
* is left as future work.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,19 @@ import org.apache.spark.sql.types.{StructField, StructType}


/**
* Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]]
* Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow]]
*
* Rows in each group are passed to the Python worker as an Arrow record batch.
* The Python worker turns the record batch to a `pandas.DataFrame`, invoke the
* user-defined function, and passes the resulting `pandas.DataFrame`
* The Python worker turns the record batch to a `pyarrow.Table`, invokes the
* user-defined function, and passes the resulting `pyarrow.Table`
* as an Arrow record batch. Finally, each record batch is turned to
* Iterator[InternalRow] using ColumnarBatch.
*
* Note on memory usage:
* Both the Python worker and the Java executor need to have enough memory to
* hold the largest group. The memory on the Java side is used to construct the
* record batch (off heap memory). The memory on the Python side is used for
* holding the `pandas.DataFrame`. It's possible to further split one group into
* holding the `pyarrow.Table`. It's possible to further split one group into
* multiple record batches to reduce the memory footprint on the Java side, this
* is left as future work.
*/
Expand Down

0 comments on commit ab7aea1

Please sign in to comment.