Skip to content

Conversation

@xy-xin
Copy link

@xy-xin xy-xin commented Aug 30, 2019

What changes were proposed in this pull request?

This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:

UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?

Why are the changes needed?

With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test cases added.

@xy-xin
Copy link
Author

xy-xin commented Aug 30, 2019

cc @cloud-fan @rdblue

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

add to whitelist

DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val attrsNames = attrs.map(_.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fail if some attrs are resolved to a nested field?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for nested fields and throw exception if it is nested.

BTW, I believe it would be helpful if we support nested fields update, but it may be difficult for supporting that, like, how can we pass the nested field to datasource.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109938 has finished for PR 25626 at commit a58a87b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

@xy-xin
Copy link
Author

xy-xin commented Sep 2, 2019

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

Hi @gatorsmile , I added more test cases for update, including multi-fields updating, nested fields updating, etc. pls review.
Also checked the test cases in postgresql. There're many cases for other syntax like updating via sub-query, updating multi-fields with one assignment which is not supported current by spark sql.
More test cases will be add once we support more powerful update, like update a field with an expression, update multi-fields with select/multi-values in one assignment metioned above.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110012 has finished for PR 25626 at commit a76062c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110016 has finished for PR 25626 at commit 5b738cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the .asInstanceOf[Seq[Any]]?

Copy link
Author

@xy-xin xy-xin Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, it will regard attrs to be of type Seq[Attribute], and filterNot will treat each element as Attribute which will throw a cast exception.
As you metioned in #25626 (comment), a nested filed will not be resolved to Attribute, but GetStructField or something other, so if we change the type of attrs to Seq[Expression], we can eliminate .asInstanceOf[Seq[Any]] here.


case class UpdateTable(
child: LogicalPlan,
attrs: Seq[Attribute],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we really use Seq[Attribute]? When Spark resolves it to nested field, it will be Alias which is not an Attribute, and we will get weird errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metioned in #25626 (comment), Seq[Attribute] involves the ugly .asInstanceOf[Seq[Any]] when checking the type of fields. But need we update it to Seq[Expression]? My feeling is the latter is too general to represent a field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Seq[NamedExpression]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq[NamedExpression] is ok for me, and I updated the code. But Seq[NamedExpression] is not a perfect solution, as some nested field does not have a name, which would be resolved to something like GetStructField, and it is a subclass of Expression.
Updating a nested field is complex. For a named field user can update it by specifying the name, but for an array, or a more complex json which composed by nested objects and arrays, it's hard to specify the field to be updated.
So currrently, my think is we can limit the updating to the scope of non-nested field, which can be resolved to a NamedExpression. What's your opnion? @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we have no choice but use Seq[Expression]. We should add comments to explain that attrs here holds the expressions that maybe attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can change the analyzer to only resolve column names in UpdateTable as attributes, which I don't think is worthwhile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to Seq[Expression] and added some explaining doc, pls review.

@SparkQA
Copy link

SparkQA commented Sep 3, 2019

Test build #110037 has finished for PR 25626 at commit 4d048b3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Sep 3, 2019

@xianyinxin, can you explain the required semantics for your proposed API?

void updateWhere(Map<String, Expression> sets, Filter[] filters);

It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

@cloud-fan
Copy link
Contributor

@rdblue this PR uses the public expression org.apache.spark.sql.catalog.v2.expressions.Expression

@xy-xin
Copy link
Author

xy-xin commented Sep 4, 2019

@xianyinxin, can you explain the required semantics for your proposed API?

void updateWhere(Map<String, Expression> sets, Filter[] filters);

It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

Thank you @rdblue . Here Expression is the public datasource expression org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to know the value that the field be updated to, so here the key of sets specifies the field to be updated, and the value of sets is the updated value. The value can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET a=a+1 WHERE ...

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110102 has finished for PR 25626 at commit 0fda5c8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove .asInstanceOf[Seq[Any]] now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As I explained above, Seq[NamedExpression] can not resolve the problem. See #25626 (comment).

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2019

If there isn't an urgent use case, I'd prefer to wait on this until after we support UPDATE natively in Spark. But I don't want to block this if someone does need it.

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 20, 2019

I'm preparing JDBC v2 as a showcase of Data Source V2, which supports many catalog functionalities (CREATE/ALTER/DROP TABLE) and DELETE/UPDATE, so that people can know the benefits of DS V2 for end-users. I'd like to get this in if there is no objection. @xianyinxin can you resolve the code conflicts?

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111041 has finished for PR 25626 at commit bcafbdb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111043 has finished for PR 25626 at commit 84fcd91.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111055 has finished for PR 25626 at commit 84fcd91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@rdblue
Copy link
Contributor

rdblue commented Sep 20, 2019

I don't think that an updated JDBC with UPDATE support qualifies as urgent. I would support adding this without the DSv2 API if you want to get just the parser changes in, but I think it is a bad idea to add a pushdown API without an implementation.

@cloud-fan
Copy link
Contributor

Yes I can implement JDBC update without the DS v2 API as JDBC is an internal source. @xianyinxin can you exclude the DS v2 API changes from this PR? i.e. only keep the parser change and the UpdateStatement, as well as the parser tests.

When I finish the JDBC update, we can revisit it and see how to generalize the UPDATE API design.

@xy-xin
Copy link
Author

xy-xin commented Sep 23, 2019

Ok, it looks a nice plan. I'll update the code soon.

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111196 has finished for PR 25626 at commit 3732a18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTableStatement(

@cloud-fan cloud-fan changed the title [SPARK-28892][SQL] Add UPDATE support for DataSource V2 [SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan Sep 23, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 655356e Sep 23, 2019
@rdblue
Copy link
Contributor

rdblue commented Sep 23, 2019

Thanks for coming up with a compromise, @xianyinxin and @cloud-fan!

@xy-xin
Copy link
Author

xy-xin commented Sep 24, 2019

Thank you for comments & suggestions! @cloud-fan @rdblue

gatorsmile pushed a commit that referenced this pull request Oct 8, 2019
### What changes were proposed in this pull request?

Add back the resolved logical plan for UPDATE TABLE. It was in #25626 before but was removed later.

### Why are the changes needed?

In #25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.

### Does this PR introduce any user-facing change?

no, UPDATE is still not supported yet.

### How was this patch tested?

new tests.

Closes #26025 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
atronchi pushed a commit to atronchi/spark that referenced this pull request Oct 23, 2019
### What changes were proposed in this pull request?

Add back the resolved logical plan for UPDATE TABLE. It was in apache#25626 before but was removed later.

### Why are the changes needed?

In apache#25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.

### Does this PR introduce any user-facing change?

no, UPDATE is still not supported yet.

### How was this patch tested?

new tests.

Closes apache#26025 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants