Skip to content

Conversation

@brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Sep 20, 2019

What changes were proposed in this pull request?

It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for save, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.

Why are the changes needed?

Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.

Does this PR introduce any user-facing change?

It changes the default save mode for V2 Tables in the DataFrameWriter APIs

How was this patch tested?

Existing tests

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111097 has finished for PR 25876 at commit 502cd1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 23, 2019

cc @cloud-fan @rdblue

@cloud-fan
Copy link
Contributor

With hindsight, it's more confusing to have different default save modes for DS v1 and v2 than asking users to specify append mode explicitly when writing to DS v2.

+1 for this change, with the expectation to support all save modes later. @brkyvz do we have a JIRA for it?

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111177 has finished for PR 25876 at commit 7135164.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 23, 2019

@cloud-fan
Copy link
Contributor

LGTM if tests pass

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111299 has finished for PR 25876 at commit 8710266.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val command = modeForDSV2 match {
case SaveMode.Append =>
val command = mode match {
case SaveMode.Append | SaveMode.ErrorIfExists | SaveMode.Ignore =>
Copy link
Contributor

@cloud-fan cloud-fan Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note to future readers: this is the old behavior, that non-overwrite mode means append. This is due to the bad design of DataFrameWriter: we only need to know overwrite or not when calling insert, but DataFrameWriter gives you a save mode. Since the default save mode is ErrorIfExists, treating non-overwrite mode as append is a reasonable compromise.

Note that, we don't have this problem in the new DataFrameWriterV2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the previous version used this:

      InsertIntoTable(
        table = UnresolvedRelation(tableIdent),
        partition = Map.empty[String, Option[String]],
        query = df.logicalPlan,
        overwrite = mode == SaveMode.Overwrite, // << Either overwrite or append
        ifPartitionNotExists = false)

So I agree that this is using the same behavior that v1 did.

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111306 has finished for PR 25876 at commit 792bd3b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2019

It looks like this changes to meaning of ErrorIfExists and Ignore to Append, but that's not a safe. I understand wanting to make incremental changes, but it seems to me like this should be combined with the addition of the catalog and identifier methods that we discussed in the v2 sync to avoid code in master where Ingore actually appends data. That's a correctness problem.

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 25, 2019

@rdblue It actually doesn't change the meaning. The only meaning change is in insertInto, where SaveMode only matters for an overwrite or append, and the others are meaningless. That's consistent with DataSource V1 behavior (although I agree it's weird).

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 25, 2019

retest this please

@cloud-fan
Copy link
Contributor

@rdblue IIUC what you were talking about it is #25876 (comment)

This is the v1 behavior. I agree it's weird but I think it's a reasonable compromise to the design problem in DataFrameWriter.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111316 has finished for PR 25876 at commit 792bd3b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

there are still some test failures in kafka.

@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2019

I'm fine going ahead with this since the v1 behavior is, evidently, to ignore some save modes for insertInto. Can we add this behavior to the documentation for insertInto so that it is at least stated somewhere?

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 25, 2019

@rdblue and @cloud-fan I had to modify Kafka test code (add mode("append")) for it to work, and added Kafka to the blacklisted sources list. Let me know if this is unacceptable.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111365 has finished for PR 25876 at commit 3e054c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 25, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111373 has finished for PR 25876 at commit 3e054c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.format("kafka")
.option("kafka.bootstrap.servers", testUtils.brokerAddress)
.option("topic", topic)
.mode("append")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to change this file since we disable kafka v2 by default?

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 26, 2019 via email

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c8159c7 Sep 26, 2019
dongjoon-hyun pushed a commit that referenced this pull request Oct 2, 2019
### What changes were proposed in this pull request?
In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks:
- JsonBenchmark
- CSVBenchmark
- UDFBenchmark
- MakeDateTimeBenchmark
- ExtractBenchmark
- DateTimeBenchmark
- NestedSchemaPruningBenchmark

### Why are the changes needed?
Otherwise writing to `noop` fails with:
```
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.;
[error] 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284)
```
most likely due to #25876

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
I generated results of `ExtractBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```

Closes #25988 from MaxGekk/noop-overwrite-mode.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
cloud-fan pushed a commit that referenced this pull request Apr 2, 2020
### What changes were proposed in this pull request?

The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`.

In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set.

### Why are the changes needed?
rm dead code.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests.

Closes #28090 from yaooqinn/SPARK-31321.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 2, 2020
### What changes were proposed in this pull request?

The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`.

In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set.

### Why are the changes needed?
rm dead code.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests.

Closes #28090 from yaooqinn/SPARK-31321.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1ce584f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`.

In apache#25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set.

### Why are the changes needed?
rm dead code.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests.

Closes apache#28090 from yaooqinn/SPARK-31321.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants