-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter #25876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #111097 has finished for PR 25876 at commit
|
|
With hindsight, it's more confusing to have different default save modes for DS v1 and v2 than asking users to specify append mode explicitly when writing to DS v2. +1 for this change, with the expectation to support all save modes later. @brkyvz do we have a JIRA for it? |
|
Test build #111177 has finished for PR 25876 at commit
|
|
Yes, JIRA is: https://issues.apache.org/jira/browse/SPARK-29219 |
|
LGTM if tests pass |
|
Test build #111299 has finished for PR 25876 at commit
|
| val command = modeForDSV2 match { | ||
| case SaveMode.Append => | ||
| val command = mode match { | ||
| case SaveMode.Append | SaveMode.ErrorIfExists | SaveMode.Ignore => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note to future readers: this is the old behavior, that non-overwrite mode means append. This is due to the bad design of DataFrameWriter: we only need to know overwrite or not when calling insert, but DataFrameWriter gives you a save mode. Since the default save mode is ErrorIfExists, treating non-overwrite mode as append is a reasonable compromise.
Note that, we don't have this problem in the new DataFrameWriterV2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the previous version used this:
InsertIntoTable(
table = UnresolvedRelation(tableIdent),
partition = Map.empty[String, Option[String]],
query = df.logicalPlan,
overwrite = mode == SaveMode.Overwrite, // << Either overwrite or append
ifPartitionNotExists = false)
So I agree that this is using the same behavior that v1 did.
|
Test build #111306 has finished for PR 25876 at commit
|
|
It looks like this changes to meaning of |
|
@rdblue It actually doesn't change the meaning. The only meaning change is in |
|
retest this please |
|
@rdblue IIUC what you were talking about it is #25876 (comment) This is the v1 behavior. I agree it's weird but I think it's a reasonable compromise to the design problem in |
|
Test build #111316 has finished for PR 25876 at commit
|
|
there are still some test failures in kafka. |
|
I'm fine going ahead with this since the v1 behavior is, evidently, to ignore some save modes for insertInto. Can we add this behavior to the documentation for |
|
@rdblue and @cloud-fan I had to modify Kafka test code (add |
|
Test build #111365 has finished for PR 25876 at commit
|
|
retest this please |
|
Test build #111373 has finished for PR 25876 at commit
|
| .format("kafka") | ||
| .option("kafka.bootstrap.servers", testUtils.brokerAddress) | ||
| .option("topic", topic) | ||
| .mode("append") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to change this file since we disable kafka v2 by default?
|
Yes, because the conf turns it back on, specifically to use it
…On Wed, Sep 25, 2019, 8:26 PM Wenchen Fan ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
<#25876 (comment)>:
> @@ -400,6 +400,7 @@ abstract class KafkaSinkBatchSuiteBase extends KafkaSinkSuiteBase {
.format("kafka")
.option("kafka.bootstrap.servers", testUtils.brokerAddress)
.option("topic", topic)
+ .mode("append")
Do we still need to change this file since we disable kafka v2 by default?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25876?email_source=notifications&email_token=ABIAE646Y6C2CZ2TFFW3O4TQLQTU3A5CNFSM4IY4DIE2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCF67BEY#pullrequestreview-293466259>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABIAE62V7VNOQF2JQYX2XBTQLQTU3ANCNFSM4IY4DIEQ>
.
|
|
thanks, merging to master! |
### What changes were proposed in this pull request? In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks: - JsonBenchmark - CSVBenchmark - UDFBenchmark - MakeDateTimeBenchmark - ExtractBenchmark - DateTimeBenchmark - NestedSchemaPruningBenchmark ### Why are the changes needed? Otherwise writing to `noop` fails with: ``` [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.; [error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284) ``` most likely due to #25876 ### Does this PR introduce any user-facing change? No ### How was this patch tested? I generated results of `ExtractBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25988 from MaxGekk/noop-overwrite-mode. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes #28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In #25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes #28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ce584f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In apache#25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes apache#28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for
save, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.Why are the changes needed?
Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.
Does this PR introduce any user-facing change?
It changes the default save mode for V2 Tables in the DataFrameWriter APIs
How was this patch tested?
Existing tests