-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan #25626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
add to whitelist |
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val attrsNames = attrs.map(_.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we fail if some attrs are resolved to a nested field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add check for nested fields and throw exception if it is nested.
BTW, I believe it would be helpful if we support nested fields update, but it may be difficult for supporting that, like, how can we pass the nested field to datasource.
|
Test build #109938 has finished for PR 25626 at commit
|
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?
Hi @gatorsmile , I added more test cases for update, including multi-fields updating, nested fields updating, etc. pls review. |
|
Test build #110012 has finished for PR 25626 at commit
|
|
Test build #110016 has finished for PR 25626 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the .asInstanceOf[Seq[Any]]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, it will regard attrs to be of type Seq[Attribute], and filterNot will treat each element as Attribute which will throw a cast exception.
As you metioned in #25626 (comment), a nested filed will not be resolved to Attribute, but GetStructField or something other, so if we change the type of attrs to Seq[Expression], we can eliminate .asInstanceOf[Seq[Any]] here.
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
|
|
||
| case class UpdateTable( | ||
| child: LogicalPlan, | ||
| attrs: Seq[Attribute], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we really use Seq[Attribute]? When Spark resolves it to nested field, it will be Alias which is not an Attribute, and we will get weird errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metioned in #25626 (comment), Seq[Attribute] involves the ugly .asInstanceOf[Seq[Any]] when checking the type of fields. But need we update it to Seq[Expression]? My feeling is the latter is too general to represent a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about Seq[NamedExpression]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq[NamedExpression] is ok for me, and I updated the code. But Seq[NamedExpression] is not a perfect solution, as some nested field does not have a name, which would be resolved to something like GetStructField, and it is a subclass of Expression.
Updating a nested field is complex. For a named field user can update it by specifying the name, but for an array, or a more complex json which composed by nested objects and arrays, it's hard to specify the field to be updated.
So currrently, my think is we can limit the updating to the scope of non-nested field, which can be resolved to a NamedExpression. What's your opnion? @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think we have no choice but use Seq[Expression]. We should add comments to explain that attrs here holds the expressions that maybe attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can change the analyzer to only resolve column names in UpdateTable as attributes, which I don't think is worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to Seq[Expression] and added some explaining doc, pls review.
|
Test build #110037 has finished for PR 25626 at commit
|
|
@xianyinxin, can you explain the required semantics for your proposed API? It isn't clear what the requirement for a source would be. In addition, |
|
@rdblue this PR uses the public expression |
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Outdated
Show resolved
Hide resolved
Thank you @rdblue . Here Expression is the public datasource expression org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to know the value that the field be updated to, so here the key of sets specifies the field to be updated, and the value of sets is the updated value. The value can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET a=a+1 WHERE ... |
|
Test build #110102 has finished for PR 25626 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove .asInstanceOf[Seq[Any]] now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As I explained above, Seq[NamedExpression] can not resolve the problem. See #25626 (comment).
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
|
If there isn't an urgent use case, I'd prefer to wait on this until after we support UPDATE natively in Spark. But I don't want to block this if someone does need it. |
|
I'm preparing JDBC v2 as a showcase of Data Source V2, which supports many catalog functionalities (CREATE/ALTER/DROP TABLE) and DELETE/UPDATE, so that people can know the benefits of DS V2 for end-users. I'd like to get this in if there is no objection. @xianyinxin can you resolve the code conflicts? |
|
Test build #111041 has finished for PR 25626 at commit
|
|
Test build #111043 has finished for PR 25626 at commit
|
|
retest this please |
|
Test build #111055 has finished for PR 25626 at commit
|
|
I don't think that an updated JDBC with UPDATE support qualifies as urgent. I would support adding this without the DSv2 API if you want to get just the parser changes in, but I think it is a bad idea to add a pushdown API without an implementation. |
|
Yes I can implement JDBC update without the DS v2 API as JDBC is an internal source. @xianyinxin can you exclude the DS v2 API changes from this PR? i.e. only keep the parser change and the When I finish the JDBC update, we can revisit it and see how to generalize the UPDATE API design. |
|
Ok, it looks a nice plan. I'll update the code soon. |
|
Test build #111196 has finished for PR 25626 at commit
|
|
thanks, merging to master! |
|
Thanks for coming up with a compromise, @xianyinxin and @cloud-fan! |
|
Thank you for comments & suggestions! @cloud-fan @rdblue |
### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in #25626 before but was removed later. ### Why are the changes needed? In #25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in apache#25626 before but was removed later. ### Why are the changes needed? In apache#25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes apache#26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
What changes were proposed in this pull request?
This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
Why are the changes needed?
With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New test cases added.