[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

gengliangwang · 2019-02-18T14:00:17Z

What changes were proposed in this pull request?

Spark supports writing to file data sources without getting and validation with the table schema.
For example,

spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)

There is no need to get/infer the schema from the table/path
The schema of newDF can be different with the original table schema.

However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

This PR proposes to process file sources as a special case in DataFrameWriter.save(). So that we can keep the original behavior for this DataFrame API.
The PR also reeanbles write path of Orc data source V2.

How was this patch tested?

Unit test

…tion tests for FileDataSourceV2 (partially revert )" This reverts commit a0e81fc.

gengliangwang · 2019-02-18T14:03:18Z

@cloud-fan @rdblue

cloud-fan · 2019-02-18T14:12:38Z

I think this is the case where the file source has unexpected behavior of SaveMode: for append and overwrite mode, spark expects the table already exists, but file source doesn't follow.

@rdblue are you ok with special-case file source? I'm a little worried about changing the behavior of file source in Spark 3.0.

rdblue · 2019-02-18T16:54:05Z

@cloud-fan, one of the goals of v2 is to avoid special cases for internal sources. I think we should continue to avoid them.

I'm happy to discuss a proposal for this, but I think we need a real proposal that has been thought through. Quick PRs to just-make-it-work-right-away are the reason we have unknown and unpredictable behavior in v1, and we need to be deliberate for what is introduced in v2.

I think it is fine to have cases where validation is turned off, but those need to be well-defined. This is another reason to introduce a v2 write API where the behavior is obvious to the caller.

rdblue · 2019-02-18T17:55:54Z

I'm a little worried about changing the behavior of file source in Spark 3.0.

Sorry I missed replying to this comment earlier. The reason why we agreed to develop v2 in parallel is to avoid behavior changes. The behavior in v1 is too unpredictable to be confident that we have not changed anything. Plus, we know that some behaviors must necessarily change in v2 (standardizing on file sources behavior, when that is consistent).

SparkQA · 2019-02-18T18:06:07Z

Test build #102479 has finished for PR 23829 at commit 8316c07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-19T02:31:53Z

@rdblue there are 2 problems here

file source should not have schema validation during write
file source can't report schema during write, if the output path doesn't exist

For 1, I think we can introduce a new trait(or capability API) to indicate that a data source doesn't need schema validation during write.

For 2, I think we need the CTAS(and RTAS) operator.

One thing to note, DataFrameWriter API can mix data and metadata operations. e.g. df.mode("append") can append data to a non-existing table, with CTAS semantic. I can't find a corresponding SQL operator, maybe we need to create a new one like CREATE OR INSERT TABLE ... AS SELECT.

How would the ongoing catalog API proposal solve this issue?

gengliangwang · 2019-02-19T04:33:11Z

Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

That's my major concern. The behavior of DataFrameWriter.save is so different from table insertion in traditional databases.

HyukjinKwon · 2019-02-19T05:12:06Z

I think we should add an interface for backward compatibility of course, and I thought that's what we agreed upon already.

gengliangwang · 2019-02-19T05:34:53Z

@HyukjinKwon I have created #23824 for adding a new interface. But got -1 from @cloud-fan and @rdblue . I agree on their comments.

gengliangwang · 2019-02-19T05:41:34Z

After thoughts, I don't think DataFrameWriter.save has to be involved with the table concept. Otherwise, things can be complex and there can be two different behaviors in one API DataFrameWriter.save.

That is to say, let's remove the expression AppendData and OverwriteByExpression in DataFrameWriter.save, since their behaviors are different with the API's.

It just doesn’t have to be a table. We can use AppendData and OverwriteByExpression in DataFrameWriter.saveAsTable, which is more appropriate.

gengliangwang · 2019-02-19T07:38:48Z

Create #23836 and close this one.

rdblue · 2019-02-19T18:38:21Z

@cloud-fan, here are my answers:

file source should not have schema validation during write

Validation should be configured by the source, just like we talked about for sources that can data with missing columns.

I think the larger issue is finding out what the correct behavior is. What tables should opt out of validation? What tables should just use different rules, like allowing new columns?

file source can't report schema during write, if the output path doesn't exist

In this case, the table catalog that supports path-based tables will check existence. If the path doesn't exist, then the table doesn't exist. Then the writer should use a CreateTableAsSelect plan instead of an overwrite plan. CTAS doesn't validation against an existing schema, it creates the table using the given schema.

gengliangwang added 2 commits February 18, 2019 21:42

reenable file source v2

304543d

Revert "[SPARK-26744][SPARK-26744][SQL][HOTFOX] Disable schema valida…

8316c07

…tion tests for FileDataSourceV2 (partially revert )" This reverts commit a0e81fc.

gengliangwang mentioned this pull request Feb 18, 2019

[SPARK-26913][SQL] New data source V2 API: SupportsDirectWrite #23824

Closed

gengliangwang closed this Feb 19, 2019

gengliangwang mentioned this pull request Feb 19, 2019

[SPARK-26915][SQL] DataFrameWriter.save() should write without schema validation #23836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

cloud-fan commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

SparkQA commented Feb 18, 2019

Uh oh!

cloud-fan commented Feb 19, 2019 •

edited

Loading

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

HyukjinKwon commented Feb 19, 2019

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

gengliangwang commented Feb 19, 2019 •

edited

Loading

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

rdblue commented Feb 19, 2019

Uh oh!

Uh oh!

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

Uh oh!

Conversation

gengliangwang commented Feb 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Feb 18, 2019

Uh oh!

cloud-fan commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

rdblue commented Feb 18, 2019

Uh oh!

SparkQA commented Feb 18, 2019

Uh oh!

cloud-fan commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

HyukjinKwon commented Feb 19, 2019

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

gengliangwang commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Feb 19, 2019

Uh oh!

rdblue commented Feb 19, 2019

Uh oh!

Uh oh!

cloud-fan commented Feb 19, 2019 •

edited

Loading

gengliangwang commented Feb 19, 2019 •

edited

Loading