-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-14123][SQL] Implement function related DDL commands #12036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #54434 has finished for PR 12036 at commit
|
Test build #54437 has finished for PR 12036 at commit
|
Test build #54439 has finished for PR 12036 at commit
|
The failed tests are due to describe function DDL command. Let me fix it. |
Test build #54442 has finished for PR 12036 at commit
|
@@ -71,7 +71,7 @@ private[hive] class HiveSessionState(ctx: HiveContext) extends SessionState(ctx) | |||
/** | |||
* Parser for HiveQl query texts. | |||
*/ | |||
override lazy val sqlParser: ParserInterface = HiveSqlParser | |||
override lazy val sqlParser: ParserInterface = HiveSqlParser(this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't pass in the SessionState
here if you're just gonna use the function registry. In general it's better to pass in narrower things so we don't have a tight coupling across classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. got it.
@viirya I took a look at the existing code in I would recommend putting this task on hold. In the mean time I'll start working on SPARK-14253. |
@andrewor14 Thanks for comment. But I think you might miss reading the code changes in this PR. This PR already did what you said, i.e., to avoid registering temporary functions in Hive. In fact, this change is the most important one in this PR. Other part just calls Catalog API. Temporary functions are not go through Hive after this PR and we create function builder ourselves in So the temporary functions extending Hive UDF classes, are maintained in Spark with this PR. I would hope we don't duplicate our works. Thanks! |
def expression[T <: Expression](name: String) | ||
(implicit tag: ClassTag[T]): (String, (ExpressionInfo, FunctionBuilder)) = { | ||
|
||
def expression[T <: Expression]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create FunctionBuilder
here for temporary functions.
@viirya Thanks for the clarification. That said, I think the interfaces here can be simplified significantly. For instance, we don't need to expose I did a first part of the clean up in #12051. With those changes, I think implementing the function DDLs will be pretty straightforward. The only additional thing that needs to be implemented is adding the resource list to |
My previous commit gets a test failure due to the incorrect |
@andrewor14 I quickly go though #12051. You simplify the existing codes. Basically I did the similar thing here to make a method to create |
@andrewor14 So I think our works are duplicate in some places. Do you want me to wait for #12051 to merge first and then rebase against it? Thanks! |
That would be best. |
Test build #54471 has finished for PR 12036 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
@andrewor14 I go through #12051 again. Your main change is in However, in order to make the temporary function work in Spark, instead of passing through Hive, there are many places needed to take care as I did in this PR. I think to duplicate our works is not making sense. As your PR just begins with few changes. I think my changes in this PR are functional and ready for review. I will suggest we focus on this change. What do you think? |
@@ -611,6 +615,9 @@ private[hive] class HiveClientImpl( | |||
.asInstanceOf[Class[_ <: org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]] | |||
|
|||
private def toHiveFunction(f: CatalogFunction, db: String): HiveFunction = { | |||
val resourceUris = f.resources.map { resource => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f.resources.map { case (resourceType, resourcePath) =>
I just did a close review and there were a few design choices that I would like to discuss further. The biggest thing is I'm not sure if it's the responsibility of @yhuai What do you think? |
@andrewor14 Thanks for reviewing this. The current changes might need further refactoring. As you said, |
Test build #54612 has finished for PR 12036 at commit
|
Test build #54617 has finished for PR 12036 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkQl.scala
Test build #54680 has finished for PR 12036 at commit
|
@viirya Thank you for working on it. I think the overall approach makes sense. Looks like we can adjust how we organize SessionState, SessionCatalog, and FunctionRegistry to improve the structure of the code (@andrewor14 also mentioned about it). I'd like to play with it based on this PR. Would you mind to hold off your future changes until I send out commits based on your current branch? Thanks! |
@yhuai OK. Please let me know if any changes are not clear to you. Thanks! |
## What changes were proposed in this pull request? This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes. * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted. * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`. * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`. This PR is based on viirya's #12036 ## How was this patch tested? Existing tests and new tests. ## TODOs [x] Self-review [x] Cleanup [x] More tests for create/drop functions (we need to more tests for permanent functions). [ ] File JIRAs for all TODOs [x] Standardize the error message when a function does not exist. Author: Yin Huai <yhuai@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12117 from yhuai/function.
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14123
This PR adds the support of native DDL commands to create and drop (temporary) functions.
For temporary function, these DDL commands call
FunctionRegistry
related APIs. This PR modifies currentFunctionRegistry
andHiveFunctionRegistry
APIs to have proper temporary function support.For permanent function, we will look up the function through catalog API and get
CatalogFunction
if the function exists. Then we will load necessary resources such as JARs, Files. Once the resources are loaded, we can createFunctionBuilder
for it and generate anExpression
.How was this patch tested?
Existing tests.