Adds support for R UDFs #106

edgararuiz · 2024-02-02T18:40:59Z

Builds the foundation that allows R functions to run in Spark/Databricks Connect

It uses the rpy2 Python library to serialize and run the R code inside the cluster
It uses applyInPandas() to run the R function (via rpy2) against a grouped data frame
It uses mapInPandas() to run the R function (via rpy2) against non-grouped data frames
Properly sets the max Arrow batch to partition de data
Wraps functionality into spark_apply(). Most of the integration is in functions that spark_apply() calls, so the functionality can be re-used by future interfaces
Adds rpy2 to the list of packages to be installed by install_databricks()/install_pyspark()
Verifies if rpy2 is not install in your current Python environment and offers to install
Adds tests

…onversion

…ll `rpy2` on the fly

… version

…klyr dev

…ith barrier if < Spark 3.5

edgararuiz added 24 commits January 24, 2024 12:47

Adds Arrow conf sets

91336cd

Adds spark_apply method and copies initial code

83b21bb

Adds udf scripts to inst

d2e4f7a

Passes default schema if not is passed

91ca9fd

Adds aborts for some unsopported args

aec4e15

Adds support to as_sdf and naming

4875d77

simplifies script selection

f9261f7

Supresses yet another warning

1b29546

Converts to pandas if df is a lazy table

b4346c8

Adds support for barrier

cd5586e

Adds auto_deps and partition_index_param failure if used

4efa316

Failure if batch arrow size is passed

fd8cc2f

Ability to pull resulting R function only

2bf086c

Makes sure that grouping column is first on R UDF

67b59ed

Adds ability to automatically create the .schema

04e67dd

Adds message with schema if one is not passed, adds timestamp field c…

ccb9b9c

…onversion

Adds support for maxRecordsPerBatch for Arrow

4409e4b

Msg improvements

16456fb

Updates DESCRIPTION and NEWS

83078fd

Adds rpy2 to installation list

538f956

Makes the ML library checker into a generic one, and uses it to insta…

666d1d8

…ll `rpy2` on the fly

Converts periods to underscores in colnames

dad56db

Initial tests

6b2110c

Adds more tests

b725615

edgararuiz mentioned this pull request Feb 2, 2024

Makes spark_apply() a method sparklyr/sparklyr#3418

Merged

edgararuiz added 4 commits February 2, 2024 13:25

Updates installation tests snapshots, changes sparklyr dep to use dev…

55c19f1

… version

Updates data-write and dplyr snapshots. Needed after updating to spar…

8ed2e18

…klyr dev

Updates pivot_longer tests snapshot, adds arrow package to CI

48c4715

Accounts for new Spark dataframe header output, skips spark_apply() w…

a300510

…ith barrier if < Spark 3.5

edgararuiz merged commit 93c022e into main Feb 2, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for R UDFs #106

Adds support for R UDFs #106

edgararuiz commented Feb 2, 2024

Adds support for R UDFs #106

Adds support for R UDFs #106

Conversation

edgararuiz commented Feb 2, 2024