[SPARK-14123][SQL] Implement function related DDL commands #12036

viirya · 2016-03-29T12:33:31Z

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14123

This PR adds the support of native DDL commands to create and drop (temporary) functions.

For temporary function, these DDL commands call FunctionRegistry related APIs. This PR modifies current FunctionRegistry and HiveFunctionRegistry APIs to have proper temporary function support.

For permanent function, we will look up the function through catalog API and get CatalogFunction if the function exists. Then we will load necessary resources such as JARs, Files. Once the resources are loaded, we can create FunctionBuilder for it and generate an Expression.

How was this patch tested?

Existing tests.

SparkQA · 2016-03-29T12:39:06Z

Test build #54434 has finished for PR 12036 at commit 35ad7ae.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-29T13:35:37Z

Test build #54437 has finished for PR 12036 at commit cb29f0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-29T14:30:16Z

Test build #54439 has finished for PR 12036 at commit 77848c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HiveSqlParser(sessionState: HiveSessionState) extends AbstractSqlParser
- class HiveSqlAstBuilder(sessionState: HiveSessionState) extends SparkSqlAstBuilder

viirya · 2016-03-29T15:17:23Z

The failed tests are due to describe function DDL command. Let me fix it.

SparkQA · 2016-03-29T15:38:55Z

Test build #54442 has finished for PR 12036 at commit b8dda84.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-29T17:59:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

@@ -71,7 +71,7 @@ private[hive] class HiveSessionState(ctx: HiveContext) extends SessionState(ctx)
  /**
   * Parser for HiveQl query texts.
   */
-  override lazy val sqlParser: ParserInterface = HiveSqlParser
+  override lazy val sqlParser: ParserInterface = HiveSqlParser(this)


don't pass in the SessionState here if you're just gonna use the function registry. In general it's better to pass in narrower things so we don't have a tight coupling across classes.

ok. got it.

andrewor14 · 2016-03-29T21:23:43Z

@viirya I took a look at the existing code in HiveFunctionRegistry and I think there's some cleanup we need to do before we can implement these DDLs. In particular, temporary functions should not go through Hive (SPARK-14253). Instead, we should just create a function builder ourselves every time we get CREATE TEMPORARY FUNCTION and use that to call sessionState.catalog.createTempFunction. This ensures all temporary functions, even the ones extending Hive UDF classes, are maintained only in Spark. THEN it will be much easier to implement the DDLs in a clean way.

I would recommend putting this task on hold. In the mean time I'll start working on SPARK-14253.

viirya · 2016-03-29T23:15:27Z

@andrewor14 Thanks for comment. But I think you might miss reading the code changes in this PR.

This PR already did what you said, i.e., to avoid registering temporary functions in Hive. In fact, this change is the most important one in this PR. Other part just calls Catalog API.

Temporary functions are not go through Hive after this PR and we create function builder ourselves in HiveFunctionRegistry here.

So the temporary functions extending Hive UDF classes, are maintained in Spark with this PR.

I would hope we don't duplicate our works. Thanks!

viirya · 2016-03-29T23:22:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

-  def expression[T <: Expression](name: String)
-      (implicit tag: ClassTag[T]): (String, (ExpressionInfo, FunctionBuilder)) = {
-
+  def expression[T <: Expression](


We create FunctionBuilder here for temporary functions.

andrewor14 · 2016-03-29T23:38:17Z

@viirya Thanks for the clarification. That said, I think the interfaces here can be simplified significantly. For instance, we don't need to expose ExpressionInfo since we can just get that from FunctionBuilder, then we don't need to introduce the clunky tuple interfaces in FunctionRegistry.

I did a first part of the clean up in #12051. With those changes, I think implementing the function DDLs will be pretty straightforward. The only additional thing that needs to be implemented is adding the resource list to CatalogFunction, but that's also simple to do.

viirya · 2016-03-29T23:43:31Z

My previous commit gets a test failure due to the incorrect ExpressionInfo. The current method in FunctionRegistry to retrieve FunctionBuilder is not working. It directly gets canonical name from FunctionBuilder. However, by doing this, you would not get the correct class name of your original function class. So I change it to pass ExpressionInfo.

viirya · 2016-03-29T23:53:05Z

@andrewor14 I quickly go though #12051. You simplify the existing codes. Basically I did the similar thing here to make a method to create FunctionBuilder (i.e., makeFunctionBuilder in #12051).

viirya · 2016-03-29T23:58:38Z

@andrewor14 So I think our works are duplicate in some places. Do you want me to wait for #12051 to merge first and then rebase against it? Thanks!

andrewor14 · 2016-03-30T00:17:25Z

So I think our works are duplicate in some places. Do you want me to wait for #12051 to merge first and then rebase against it? Thanks!

That would be best.

SparkQA · 2016-03-30T00:29:42Z

Test build #54471 has finished for PR 12036 at commit ee957db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

viirya · 2016-03-30T02:53:58Z

@andrewor14 I go through #12051 again. Your main change is in HiveFunctionRegistry. I compare our changes in this place and they are nearly doing the same thing, excepts for few differences.

However, in order to make the temporary function work in Spark, instead of passing through Hive, there are many places needed to take care as I did in this PR.

I think to duplicate our works is not making sense. As your PR just begins with few changes. I think my changes in this PR are functional and ready for review. I will suggest we focus on this change.

What do you think?

cc @yhuai @rxin

andrewor14 · 2016-03-30T19:06:19Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -611,6 +615,9 @@ private[hive] class HiveClientImpl(
      .asInstanceOf[Class[_ <: org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]]

  private def toHiveFunction(f: CatalogFunction, db: String): HiveFunction = {
+    val resourceUris = f.resources.map { resource =>


f.resources.map { case (resourceType, resourcePath) =>

andrewor14 · 2016-03-30T21:14:53Z

I just did a close review and there were a few design choices that I would like to discuss further. The biggest thing is I'm not sure if it's the responsibility of HiveFunctionRegistry.lookupFunction to load the hive jars as well. Right now we need to pass in HiveContext and SessionCatalog into HiveFunctionRegistry, which seems like a messy design because it couples these few classes too closely. I think a better place to load the hive jars will be in HiveSessionCatalog itself, which already takes in a HiveContext.

@yhuai What do you think?

viirya · 2016-03-31T06:26:29Z

@andrewor14 Thanks for reviewing this. The current changes might need further refactoring. As you said, HiveFunctionRegistry now refers to HiveContext and SessionCatalog. It is a messy one but I just want it to be functional and prove this approach work. I think I would do refactoring today to make it more clear. Thanks.

SparkQA · 2016-03-31T09:45:32Z

Test build #54612 has finished for PR 12036 at commit 2cab41c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Resource(resourceType: String, path: String)

SparkQA · 2016-03-31T12:01:36Z

Test build #54617 has finished for PR 12036 at commit 65d9dbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkQl.scala

SparkQA · 2016-04-01T01:46:34Z

Test build #54680 has finished for PR 12036 at commit 05709f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-01T02:11:32Z

@viirya Thank you for working on it. I think the overall approach makes sense. Looks like we can adjust how we organize SessionState, SessionCatalog, and FunctionRegistry to improve the structure of the code (@andrewor14 also mentioned about it). I'd like to play with it based on this PR. Would you mind to hold off your future changes until I send out commits based on your current branch? Thanks!

viirya · 2016-04-01T02:28:14Z

@yhuai OK. Please let me know if any changes are not clear to you. Thanks!

## What changes were proposed in this pull request? This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes. * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted. * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`. * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`. This PR is based on viirya's #12036 ## How was this patch tested? Existing tests and new tests. ## TODOs [x] Self-review [x] Cleanup [x] More tests for create/drop functions (we need to more tests for permanent functions). [ ] File JIRAs for all TODOs [x] Standardize the error message when a function does not exist. Author: Yin Huai <yhuai@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12117 from yhuai/function.

viirya added 3 commits March 27, 2016 13:15

init import.

7d00184

Merge remote-tracking branch 'upstream/master' into native-ddl-function

9af70af

Fix hive temp function feature.

35ad7ae

Fix scala style.

cb29f0f

Use HiveFunctionRegistry to find Generator function in HiveSqlParser.

77848c9

Fix describe function problem.

b8dda84

andrewor14 reviewed Mar 29, 2016
View reviewed changes

Fix scala style.

ee957db

viirya reviewed Mar 29, 2016
View reviewed changes

viirya added 2 commits March 30, 2016 02:34

Fix test.

133ce1a

Merge remote-tracking branch 'upstream/master' into native-ddl-function

6b76980

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

andrewor14 reviewed Mar 30, 2016
View reviewed changes

Refactoring.

2cab41c

Fix compilation error.

65d9dbd

Merge remote-tracking branch 'upstream/master' into native-ddl-function

05709f0

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkQl.scala

yhuai mentioned this pull request Apr 2, 2016

[SPARK-14123] [SPARK-14384] [SQL] Handle CreateFunction/DropFunction #12117

Closed

viirya closed this Apr 5, 2016

viirya deleted the native-ddl-function branch December 27, 2023 18:33

[SPARK-14123][SQL] Implement function related DDL commands #12036

[SPARK-14123][SQL] Implement function related DDL commands #12036

Uh oh!

Conversation

viirya commented Mar 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

viirya commented Mar 29, 2016

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

andrewor14 Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 29, 2016

Uh oh!

viirya commented Mar 29, 2016

Uh oh!

viirya Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 29, 2016

Uh oh!

viirya commented Mar 29, 2016

Uh oh!

viirya commented Mar 29, 2016

Uh oh!

viirya commented Mar 29, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

viirya commented Mar 30, 2016

Uh oh!

andrewor14 Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

viirya commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

yhuai commented Apr 1, 2016

Uh oh!

viirya commented Apr 1, 2016

Uh oh!

Uh oh!