Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-43789][R] Uses 'spark.sql.execution.arrow.maxRecordsPerBatch' …
…in R createDataFrame with Arrow by default ### What changes were proposed in this pull request? This PR proposes to pick a proper number of partitions when create a DataFrame from R DataFrame with Arrow. Previously, the number of partitions was always `1` if not specified. Now, it splits the input R DataFrame by `spark.sql.execution.arrow.maxRecordsPerBatch`, and pick a proper number of partitions (the number of batches). This is matched with PySpark code path: https://github.com/apache/spark/blob/46949e692e863992f4c50bdd482d5216d4fd9221/python/pyspark/sql/pandas/conversion.py#L618C11-L626 ### Why are the changes needed? To avoid having OOM when the R DataFrame is too large, and enables a proper distributed computing. ### Does this PR introduce _any_ user-facing change? Yes, it changes the default partition number when users call `createDataFrame` with R DataFrame when Arrow optimization is enabled. The concept of the partition is subject to be internal, and by default it doesn't change its behaviour. ### How was this patch tested? Manually tested with a large CSV file (3 GB). Also added a unittest. Closes apache#41307 from HyukjinKwon/default-batch-size. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
- Loading branch information