Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for R UDFs #106

Merged
merged 28 commits into from
Feb 2, 2024
Merged

Adds support for R UDFs #106

merged 28 commits into from
Feb 2, 2024

Conversation

edgararuiz
Copy link
Collaborator

Builds the foundation that allows R functions to run in Spark/Databricks Connect

  • It uses the rpy2 Python library to serialize and run the R code inside the cluster
  • It uses applyInPandas() to run the R function (via rpy2) against a grouped data frame
  • It uses mapInPandas() to run the R function (via rpy2) against non-grouped data frames
  • Properly sets the max Arrow batch to partition de data
  • Wraps functionality into spark_apply(). Most of the integration is in functions that spark_apply() calls, so the functionality can be re-used by future interfaces
  • Adds rpy2 to the list of packages to be installed by install_databricks()/install_pyspark()
  • Verifies if rpy2 is not install in your current Python environment and offers to install
  • Adds tests

@edgararuiz edgararuiz merged commit 93c022e into main Feb 2, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant