Skip to content

[SPARK-14123][SQL] Implement function related DDL commands #12036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 29, 2016

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14123

This PR adds the support of native DDL commands to create and drop (temporary) functions.

For temporary function, these DDL commands call FunctionRegistry related APIs. This PR modifies current FunctionRegistry and HiveFunctionRegistry APIs to have proper temporary function support.

For permanent function, we will look up the function through catalog API and get CatalogFunction if the function exists. Then we will load necessary resources such as JARs, Files. Once the resources are loaded, we can create FunctionBuilder for it and generate an Expression.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54434 has finished for PR 12036 at commit 35ad7ae.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54437 has finished for PR 12036 at commit cb29f0f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54439 has finished for PR 12036 at commit 77848c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HiveSqlParser(sessionState: HiveSessionState) extends AbstractSqlParser
    • class HiveSqlAstBuilder(sessionState: HiveSessionState) extends SparkSqlAstBuilder

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

The failed tests are due to describe function DDL command. Let me fix it.

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54442 has finished for PR 12036 at commit b8dda84.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -71,7 +71,7 @@ private[hive] class HiveSessionState(ctx: HiveContext) extends SessionState(ctx)
/**
* Parser for HiveQl query texts.
*/
override lazy val sqlParser: ParserInterface = HiveSqlParser
override lazy val sqlParser: ParserInterface = HiveSqlParser(this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't pass in the SessionState here if you're just gonna use the function registry. In general it's better to pass in narrower things so we don't have a tight coupling across classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. got it.

@andrewor14
Copy link
Contributor

@viirya I took a look at the existing code in HiveFunctionRegistry and I think there's some cleanup we need to do before we can implement these DDLs. In particular, temporary functions should not go through Hive (SPARK-14253). Instead, we should just create a function builder ourselves every time we get CREATE TEMPORARY FUNCTION and use that to call sessionState.catalog.createTempFunction. This ensures all temporary functions, even the ones extending Hive UDF classes, are maintained only in Spark. THEN it will be much easier to implement the DDLs in a clean way.

I would recommend putting this task on hold. In the mean time I'll start working on SPARK-14253.

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

@andrewor14 Thanks for comment. But I think you might miss reading the code changes in this PR.

This PR already did what you said, i.e., to avoid registering temporary functions in Hive. In fact, this change is the most important one in this PR. Other part just calls Catalog API.

Temporary functions are not go through Hive after this PR and we create function builder ourselves in HiveFunctionRegistry here.

So the temporary functions extending Hive UDF classes, are maintained in Spark with this PR.

I would hope we don't duplicate our works. Thanks!

def expression[T <: Expression](name: String)
(implicit tag: ClassTag[T]): (String, (ExpressionInfo, FunctionBuilder)) = {

def expression[T <: Expression](
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create FunctionBuilder here for temporary functions.

@andrewor14
Copy link
Contributor

@viirya Thanks for the clarification. That said, I think the interfaces here can be simplified significantly. For instance, we don't need to expose ExpressionInfo since we can just get that from FunctionBuilder, then we don't need to introduce the clunky tuple interfaces in FunctionRegistry.

I did a first part of the clean up in #12051. With those changes, I think implementing the function DDLs will be pretty straightforward. The only additional thing that needs to be implemented is adding the resource list to CatalogFunction, but that's also simple to do.

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

My previous commit gets a test failure due to the incorrect ExpressionInfo. The current method in FunctionRegistry to retrieve FunctionBuilder is not working. It directly gets canonical name from FunctionBuilder. However, by doing this, you would not get the correct class name of your original function class. So I change it to pass ExpressionInfo.

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

@andrewor14 I quickly go though #12051. You simplify the existing codes. Basically I did the similar thing here to make a method to create FunctionBuilder (i.e., makeFunctionBuilder in #12051).

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

@andrewor14 So I think our works are duplicate in some places. Do you want me to wait for #12051 to merge first and then rebase against it? Thanks!

@andrewor14
Copy link
Contributor

So I think our works are duplicate in some places. Do you want me to wait for #12051 to merge first and then rebase against it? Thanks!

That would be best.

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54471 has finished for PR 12036 at commit ee957db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

viirya added 2 commits March 30, 2016 02:34
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
@viirya
Copy link
Member Author

viirya commented Mar 30, 2016

@andrewor14 I go through #12051 again. Your main change is in HiveFunctionRegistry. I compare our changes in this place and they are nearly doing the same thing, excepts for few differences.

However, in order to make the temporary function work in Spark, instead of passing through Hive, there are many places needed to take care as I did in this PR.

I think to duplicate our works is not making sense. As your PR just begins with few changes. I think my changes in this PR are functional and ready for review. I will suggest we focus on this change.

What do you think?

cc @yhuai @rxin

@@ -611,6 +615,9 @@ private[hive] class HiveClientImpl(
.asInstanceOf[Class[_ <: org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]]

private def toHiveFunction(f: CatalogFunction, db: String): HiveFunction = {
val resourceUris = f.resources.map { resource =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f.resources.map { case (resourceType, resourcePath) =>

@andrewor14
Copy link
Contributor

I just did a close review and there were a few design choices that I would like to discuss further. The biggest thing is I'm not sure if it's the responsibility of HiveFunctionRegistry.lookupFunction to load the hive jars as well. Right now we need to pass in HiveContext and SessionCatalog into HiveFunctionRegistry, which seems like a messy design because it couples these few classes too closely. I think a better place to load the hive jars will be in HiveSessionCatalog itself, which already takes in a HiveContext.

@yhuai What do you think?

@viirya
Copy link
Member Author

viirya commented Mar 31, 2016

@andrewor14 Thanks for reviewing this. The current changes might need further refactoring. As you said, HiveFunctionRegistry now refers to HiveContext and SessionCatalog. It is a messy one but I just want it to be functional and prove this approach work. I think I would do refactoring today to make it more clear. Thanks.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54612 has finished for PR 12036 at commit 2cab41c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Resource(resourceType: String, path: String)

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54617 has finished for PR 12036 at commit 65d9dbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/SparkQl.scala
@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54680 has finished for PR 12036 at commit 05709f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Apr 1, 2016

@viirya Thank you for working on it. I think the overall approach makes sense. Looks like we can adjust how we organize SessionState, SessionCatalog, and FunctionRegistry to improve the structure of the code (@andrewor14 also mentioned about it). I'd like to play with it based on this PR. Would you mind to hold off your future changes until I send out commits based on your current branch? Thanks!

@viirya
Copy link
Member Author

viirya commented Apr 1, 2016

@yhuai OK. Please let me know if any changes are not clear to you. Thanks!

@viirya viirya closed this Apr 5, 2016
asfgit pushed a commit that referenced this pull request Apr 5, 2016
## What changes were proposed in this pull request?
This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes.
* `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted.
* SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`.
* A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`.

This PR is based on viirya's #12036

## How was this patch tested?
Existing tests and new tests.

## TODOs
[x] Self-review
[x] Cleanup
[x] More tests for create/drop functions (we need to more tests for permanent functions).
[ ] File JIRAs for all TODOs
[x] Standardize the error message when a function does not exist.

Author: Yin Huai <yhuai@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12117 from yhuai/function.
@viirya viirya deleted the native-ddl-function branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants