[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF #9766

zjffdu · 2015-11-17T10:01:16Z

Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits:

Leverage the power of rich third party java library
Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance.

SparkQA · 2015-11-17T12:01:39Z

Test build #46078 has finished for PR 9766 at commit dd1e269.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * throw new IOException(s\"UDF class $\n * throw new IOException(s\"It is invalid to implement multiple UDF interfaces, UDF class $\n * case n => logError(s\"UDF class with $\n * logError(s\"Can not instantiate class $\n * case e: ClassNotFoundException => logError(s\"Can not load class $\n

zjffdu · 2015-12-01T01:44:07Z

Could anyone help review this ? Thanks

SparkQA · 2016-01-26T04:27:03Z

Test build #50066 has finished for PR 9766 at commit 2e17865.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-01-26T05:07:07Z

please test it again.

holdenk · 2016-01-27T00:26:32Z

python/pyspark/sql/context.py

@@ -224,6 +224,10 @@ def registerFunction(self, name, f, returnType=StringType()):
        udf = UserDefinedFunction(f, returnType, name)
        self._ssql_ctx.udf().registerPython(name, udf._judf)

+    def registerJavaFunction(self, name, javaClassName, returnType):


You'll probably want a since annotation as well as some PyDoc here.

Also add some tests.

holdenk · 2016-01-27T00:29:48Z

I think this could be useful, although if people are already intermixing Scala (or Java) + Python code maybe they should just make the registration code in Java and call the register function using py4j? This could be more convenient though. Where do you think this would be most useful?
(Also please add tests :))

zjffdu · 2016-01-27T03:27:52Z

Thanks @holdenk for review this. One scenario is that the udf function has already been implemented in java, user don't need to wrap it using py4j and register it as python udf. This might be valuable when one company has its own udf repository, they can just implement each udf using one language (java/python) and its pyspark user can just register the UDF without wrap it using py4j.

holdenk · 2016-01-28T19:47:36Z

Oh right, I was thinking more not wrapping the UDF and turning it into a Python UDF (thats going to kill performance) - but more if one has Scala/Java UDFs is the overhead of registering it in Scala code and use it in Python (e.g. writing a function like https://github.com/sparklingpandas/sparklingpandas/blob/master/src/main/scala/com/sparklingpandas/AggregationUDFs.scala#L36 and then calling it with py4j) high enough having a wrapper to do it directly from python is useful?

zjffdu · 2016-02-02T09:24:48Z

Do you mean wrapping java/scala UDF using python and call it through py4j ? One concern is the performance, because in this way we still need to launch python process. And if you already have java/scala UDF why not register it directly in python (what this ticket is doing) and call it using data frame api.

davies · 2016-04-18T20:21:18Z

@zjffdu Will CREATE FUNCTION work for this case?

zjffdu · 2016-04-19T09:20:28Z

@davies It seems not supported based on my experiment.
Besides, I found 2 other issues

name of the create function will prepend "default" (might be database name), is it expected ?
seems HiveContext can only create hive udf but not generic spark udf. I don't think it make sense.

Py4JJavaError: An error occurred while calling o504.select.
: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:520)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:611)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$35.apply(Analyzer.scala:837)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$35.apply(Analyzer.scala:837)
    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:836)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:824)

davies · 2016-04-19T18:05:39Z

@zjffdu Maybe we should support register a Java UDF using SQL(SQL users can also benefit from that)

davies · 2016-04-19T18:06:38Z

@zjffdu It seems this is useful, could you add docs and tests for this?

zjffdu · 2016-04-19T23:40:03Z

@davies Sure, I will add tests and docs for it.

SparkQA · 2016-05-12T09:36:00Z

Test build #58468 has finished for PR 9766 at commit 2e17865.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-12T14:05:02Z

Test build #58482 has finished for PR 9766 at commit ed275c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CosmoYs · 2016-05-13T08:19:46Z

@zjffdu
I’m working on a project which involves the integration between pyspark and java by using udf, this feature can be really useful.
I read through your discussion, but I’m not clear with something. Without this feature, we can still call a java udf in pyspark through py4j ? Could you explain this in more detail ? Thanks.

zjffdu · 2016-05-13T08:33:23Z

Without this feature, you can just call the builtin udf (org.apache.spark.sql.functions), but can't register your custom udf. This PR is to allow user to register his custom java udf.

zjffdu · 2016-05-30T11:53:25Z

@davies Would you mind to take a look at it ? Thanks

SparkQA · 2016-05-30T13:17:08Z

Test build #59615 has finished for PR 9766 at commit 5feb2e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-07-22T20:47:40Z

I was looking at some similar stuff as part of #13571 and I was thinking that (to match the Scala API) it would be good to return the UDF object as well so people can use it progmatically with the DataFrame API instead of just just limited to using it inside of SQL queries.

holdenk · 2016-08-03T20:21:48Z

Does @davies perhaps have bandwith to look at this? (Also maybe @zjffdu consider merging in master? In the spark-pr dashboard this is shown as unable to merge against master even though github shows no conflicts).

holdenk · 2016-08-03T20:26:19Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+   * @param className
+   * @param returnType
+   */
+  def registerJava(name: String, className: String, returnType: DataType): Unit = {


I don't know if we want to expose this API for general use - maybe make it private so that it can only be called from Python? And maybe update the scaladoc to something like "Register a Java UDF class using reflection - for use from Python".

+ 1 for update the scaladoc. Register a Java UDF class doesn't exactly convey the meaning of this function.

Besides, there are no document for the parameters. Should be better to add doc for that.

GregBowyer · 2016-09-22T18:54:25Z

Where do we stand on this, I just reapplied this patch to a spark 2.1-xxx build to get the same behaviour.

zjffdu · 2016-09-22T23:13:17Z

Rebase the PR, @davies @JoshRosen Could you help to review it ? Thanks

SparkQA · 2016-09-23T01:16:19Z

Test build #65797 has finished for PR 9766 at commit dc31d78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-10-07T19:03:19Z

+1 to this functionality, but also to the request to add more tests and documentation. It would also to be good to comment on the idea of using SQL as a more general way to implement this.

SparkQA · 2016-10-09T10:01:47Z

Test build #66599 has finished for PR 9766 at commit 93d565c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-10-11T20:01:50Z

python/pyspark/sql/context.py

+        """Register a java UDF so it can be used in SQL statements.
+
+        In addition to a name and the function itself, the return type can be optionally specified.
+        When the return type is not given it default to a string and conversion will automatically


Where does this conversion happen? Are we sure that it works (given there are no tests that I see).

It's my mistake of copying these comments, actually there's no conversion.

marmbrus · 2016-10-11T20:01:57Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+  def registerJava(name: String, className: String, returnType: DataType): Unit = {
+
+    try {
+      // scalastyle:off classforname


This style rule is here to prevent misuse. Is there a reason we aren't using our utility functions?

marmbrus · 2016-10-11T20:02:18Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

 import scala.reflect.runtime.universe.TypeTag
 import scala.util.Try

+import sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl


Is this JVM specific? What is this being used for? Is there another way?

SparkQA · 2016-10-12T03:33:24Z

Test build #66791 has finished for PR 9766 at commit 9de8c0e.

This patch fails RAT tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T03:57:53Z

Test build #66793 has finished for PR 9766 at commit dc6d5f9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T05:54:25Z

Test build #66794 has finished for PR 9766 at commit 45a9b7a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T06:27:36Z

Test build #66796 has finished for PR 9766 at commit e9832f6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T09:03:27Z

Test build #66801 has finished for PR 9766 at commit d481821.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus

A few more comments. I think this is going to be a popular feature!

marmbrus · 2016-10-12T17:29:11Z

python/pyspark/sql/context.py

+        """Register a java UDF so it can be used in SQL statements.
+
+        In addition to a name and the function itself, the return type can be optionally specified.
+        When the return type is not given it would infer the returnType via reflection.


nit: its a little odd to mix return type with returnType. Perhaps, "When the return type is not specified we attempt to infer it using reflection"

marmbrus · 2016-10-12T17:30:05Z

sql/core/src/main/java/org/apache/spark/sql/test/JavaStringLength.java

+/**
+ * It is used for register Java UDF from PySpark
+ */
+public class JavaStringLength implements UDF1<String, Integer> {


Could this be moved to src/test? It would be better to not distribute it.

marmbrus · 2016-10-12T17:30:53Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

@@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends
  //////////////////////////////////////////////////////////////////////////////////////////////

  /**


It would be nice to turn style back on here since most of this function is not auto generated.

I can turn it on, but it would make the function less readable, especially for the following statements where it beyond line length limitation.

case 14 => register(name, udf.asInstanceOf[UDF13[_, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)

marmbrus · 2016-10-12T17:31:44Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+   * @param returnDataType  return type of udf. If it is null, spark would try to infer
+   *                        via reflection.
+   */
+  def registerJava(name: String, className: String, returnDataType: DataType): Unit = {


Is it possible to make this non-public? I believe we do this in other cases for code only called from python.

marmbrus · 2016-10-12T17:34:55Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+          var returnType = returnDataType
+          if (returnType == null) {
+            if (udfReturnType.isInstanceOf[Class[_]]) {
+              returnType = udfReturnType.asInstanceOf[Class[_]].getCanonicalName match {


Can we use JavaTypeInference here?

Thanks for the hint， fixed

SparkQA · 2016-10-13T03:44:14Z

Test build #66866 has finished for PR 9766 at commit 18fa6e3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-13T03:54:13Z

Test build #66868 has finished for PR 9766 at commit 00f65cd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-13T06:35:23Z

Test build #66870 has finished for PR 9766 at commit 8171b85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-10-14T22:50:05Z

Thanks, merging to master.

Orhideous · 2016-10-14T22:55:10Z

Many thanks for this feature! I've been using this code in production with Spark 1.6 (as patch) and never seen any issues with stability. 😄

Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits: * Leverage the power of rich third party java library * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance. Author: Jeff Zhang <zjffdu@apache.org> Closes apache#9766 from zjffdu/SPARK-11775.

zjffdu force-pushed the SPARK-11775 branch from dd1e269 to 2e17865 Compare January 26, 2016 04:08

holdenk reviewed Jan 27, 2016
View reviewed changes

zjffdu force-pushed the SPARK-11775 branch from 2e17865 to ed275c0 Compare May 12, 2016 12:39

zjffdu force-pushed the SPARK-11775 branch from ed275c0 to 5feb2e4 Compare May 30, 2016 11:52

holdenk reviewed Aug 3, 2016
View reviewed changes

zjffdu force-pushed the SPARK-11775 branch from 5feb2e4 to dc31d78 Compare September 22, 2016 23:12

zjffdu force-pushed the SPARK-11775 branch from dc31d78 to 93d565c Compare October 9, 2016 08:01

marmbrus suggested changes Oct 11, 2016

View reviewed changes

zjffdu added 3 commits October 12, 2016 11:28

[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF

48bfb57

add unit test

f2c9bd8

add more test and doc

e452050

zjffdu force-pushed the SPARK-11775 branch from 9de8c0e to dc6d5f9 Compare October 12, 2016 03:50

zjffdu force-pushed the SPARK-11775 branch from dc6d5f9 to 45a9b7a Compare October 12, 2016 05:47

zjffdu force-pushed the SPARK-11775 branch from 45a9b7a to e9832f6 Compare October 12, 2016 06:18

address comments

d481821

zjffdu force-pushed the SPARK-11775 branch from e9832f6 to d481821 Compare October 12, 2016 06:50

marmbrus suggested changes Oct 12, 2016

View reviewed changes

zjffdu force-pushed the SPARK-11775 branch from 18fa6e3 to 00f65cd Compare October 13, 2016 03:47

address comments

8171b85

zjffdu force-pushed the SPARK-11775 branch from 00f65cd to 8171b85 Compare October 13, 2016 04:04

asfgit closed this in f00df40 Oct 14, 2016

		@@ -414,6 +418,84 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends
		//////////////////////////////////////////////////////////////////////////////////////////////

		/**

[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF #9766

[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF #9766

Uh oh!

Conversation

zjffdu commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

zjffdu commented Dec 1, 2015

Uh oh!

SparkQA commented Jan 26, 2016

Uh oh!

zjffdu commented Jan 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Jan 27, 2016

Uh oh!

zjffdu commented Jan 27, 2016

Uh oh!

holdenk commented Jan 28, 2016

Uh oh!

zjffdu commented Feb 2, 2016

Uh oh!

davies commented Apr 18, 2016

Uh oh!

zjffdu commented Apr 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davies commented Apr 19, 2016

Uh oh!

davies commented Apr 19, 2016

Uh oh!

zjffdu commented Apr 19, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

CosmoYs commented May 13, 2016

Uh oh!

zjffdu commented May 13, 2016

Uh oh!

zjffdu commented May 30, 2016

Uh oh!

SparkQA commented May 30, 2016

Uh oh!

holdenk commented Jul 22, 2016

Uh oh!

holdenk commented Aug 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GregBowyer commented Sep 22, 2016

Uh oh!

zjffdu commented Sep 22, 2016

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

marmbrus commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zjffdu commented Apr 19, 2016 •

edited

Loading

viirya Oct 9, 2016 •

edited

Loading

marmbrus Oct 12, 2016 •

edited

Loading