[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter #25876

brkyvz · 2019-09-20T23:33:41Z

What changes were proposed in this pull request?

It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for save, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.

Why are the changes needed?

Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.

Does this PR introduce any user-facing change?

It changes the default save mode for V2 Tables in the DataFrameWriter APIs

How was this patch tested?

Existing tests

SparkQA · 2019-09-21T02:19:33Z

Test build #111097 has finished for PR 25876 at commit 502cd1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-09-23T01:41:48Z

cc @cloud-fan @rdblue

cloud-fan · 2019-09-23T04:11:24Z

With hindsight, it's more confusing to have different default save modes for DS v1 and v2 than asking users to specify append mode explicitly when writing to DS v2.

+1 for this change, with the expectation to support all save modes later. @brkyvz do we have a JIRA for it?

SparkQA · 2019-09-23T04:36:28Z

Test build #111177 has finished for PR 25876 at commit 7135164.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-09-23T18:03:27Z

Yes, JIRA is: https://issues.apache.org/jira/browse/SPARK-29219

cloud-fan · 2019-09-24T06:53:16Z

LGTM if tests pass

SparkQA · 2019-09-24T16:57:33Z

Test build #111299 has finished for PR 25876 at commit 8710266.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-24T17:09:25Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

-    val command = modeForDSV2 match {
-      case SaveMode.Append =>
+    val command = mode match {
+      case SaveMode.Append | SaveMode.ErrorIfExists | SaveMode.Ignore =>


A note to future readers: this is the old behavior, that non-overwrite mode means append. This is due to the bad design of DataFrameWriter: we only need to know overwrite or not when calling insert, but DataFrameWriter gives you a save mode. Since the default save mode is ErrorIfExists, treating non-overwrite mode as append is a reasonable compromise.

Note that, we don't have this problem in the new DataFrameWriterV2.

Looks like the previous version used this:

InsertIntoTable( table = UnresolvedRelation(tableIdent), partition = Map.empty[String, Option[String]], query = df.logicalPlan, overwrite = mode == SaveMode.Overwrite, // << Either overwrite or append ifPartitionNotExists = false)

So I agree that this is using the same behavior that v1 did.

SparkQA · 2019-09-24T22:11:23Z

Test build #111306 has finished for PR 25876 at commit 792bd3b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-09-25T00:10:23Z

It looks like this changes to meaning of ErrorIfExists and Ignore to Append, but that's not a safe. I understand wanting to make incremental changes, but it seems to me like this should be combined with the addition of the catalog and identifier methods that we discussed in the v2 sync to avoid code in master where Ingore actually appends data. That's a correctness problem.

brkyvz · 2019-09-25T00:54:50Z

@rdblue It actually doesn't change the meaning. The only meaning change is in insertInto, where SaveMode only matters for an overwrite or append, and the others are meaningless. That's consistent with DataSource V1 behavior (although I agree it's weird).

brkyvz · 2019-09-25T00:55:21Z

retest this please

cloud-fan · 2019-09-25T01:27:16Z

@rdblue IIUC what you were talking about it is #25876 (comment)

This is the v1 behavior. I agree it's weird but I think it's a reasonable compromise to the design problem in DataFrameWriter.

SparkQA · 2019-09-25T03:54:23Z

Test build #111316 has finished for PR 25876 at commit 792bd3b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-25T07:55:10Z

there are still some test failures in kafka.

rdblue · 2019-09-25T18:31:05Z

I'm fine going ahead with this since the v1 behavior is, evidently, to ignore some save modes for insertInto. Can we add this behavior to the documentation for insertInto so that it is at least stated somewhere?

brkyvz · 2019-09-25T19:17:14Z

@rdblue and @cloud-fan I had to modify Kafka test code (add mode("append")) for it to work, and added Kafka to the blacklisted sources list. Let me know if this is unacceptable.

SparkQA · 2019-09-25T21:56:55Z

Test build #111365 has finished for PR 25876 at commit 3e054c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-09-25T23:09:49Z

retest this please

SparkQA · 2019-09-26T02:45:38Z

Test build #111373 has finished for PR 25876 at commit 3e054c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-26T03:23:57Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala

      .format("kafka")
      .option("kafka.bootstrap.servers", testUtils.brokerAddress)
      .option("topic", topic)
+      .mode("append")


Do we still need to change this file since we disable kafka v2 by default?

brkyvz · 2019-09-26T06:14:28Z

Yes, because the conf turns it back on, specifically to use it

…

On Wed, Sep 25, 2019, 8:26 PM Wenchen Fan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala <#25876 (comment)>: > @@ -400,6 +400,7 @@ abstract class KafkaSinkBatchSuiteBase extends KafkaSinkSuiteBase { .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) .option("topic", topic) + .mode("append") Do we still need to change this file since we disable kafka v2 by default? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25876?email_source=notifications&email_token=ABIAE646Y6C2CZ2TFFW3O4TQLQTU3A5CNFSM4IY4DIE2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCF67BEY#pullrequestreview-293466259>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABIAE62V7VNOQF2JQYX2XBTQLQTU3ANCNFSM4IY4DIEQ> .

cloud-fan · 2019-09-26T07:20:15Z

thanks, merging to master!

### What changes were proposed in this pull request? In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks: - JsonBenchmark - CSVBenchmark - UDFBenchmark - MakeDateTimeBenchmark - ExtractBenchmark - DateTimeBenchmark - NestedSchemaPruningBenchmark ### Why are the changes needed? Otherwise writing to `noop` fails with: ``` [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.; [error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284) ``` most likely due to #25876 ### Does this PR introduce any user-facing change? No ### How was this patch tested? I generated results of `ExtractBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25988 from MaxGekk/noop-overwrite-mode. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes #28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes #28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ce584f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In apache#25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes apache#28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

remove saveModeForDSV2 from DataFrameWriter

502cd1b

dongjoon-hyun added the SQL label Sep 21, 2019

Fix tests

7135164

fix failing test

8710266

cloud-fan reviewed Sep 24, 2019

View reviewed changes

Update DataSourceV2DataFrameSuite.scala

792bd3b

brkyvz added 2 commits September 25, 2019 12:11

add kafka

331ec74

Merge branch 'removeSM' of github.com:brkyvz/spark into removeSM

3e054c7

cloud-fan reviewed Sep 26, 2019

View reviewed changes

cloud-fan closed this in c8159c7 Sep 26, 2019

MaxGekk mentioned this pull request Oct 1, 2019

[SPARK-29313][SQL] Fix failure on writing to noop in benchmarks #25988

Closed

dongjoon-hyun mentioned this pull request Oct 14, 2019

[SPARK-29442][SQL] Set default mode should override the existing mode #26094

Closed

yaooqinn mentioned this pull request Apr 2, 2020

[SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder #28090

Closed

[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter #25876

[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter #25876

Uh oh!

Conversation

brkyvz commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

brkyvz commented Sep 23, 2019

Uh oh!

cloud-fan commented Sep 23, 2019

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

brkyvz commented Sep 23, 2019

Uh oh!

cloud-fan commented Sep 24, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

cloud-fan Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 25, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

rdblue commented Sep 25, 2019

Uh oh!

brkyvz commented Sep 25, 2019

Uh oh!

brkyvz commented Sep 25, 2019

Uh oh!

cloud-fan commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

cloud-fan commented Sep 25, 2019

Uh oh!

rdblue commented Sep 25, 2019

Uh oh!

brkyvz commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

brkyvz commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

cloud-fan Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Sep 26, 2019 via email

Uh oh!

cloud-fan commented Sep 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brkyvz commented Sep 20, 2019 •

edited

Loading

cloud-fan Sep 24, 2019 •

edited

Loading