-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tion tests for FileDataSourceV2 (partially revert )" This reverts commit a0e81fc.
I think this is the case where the file source has unexpected behavior of SaveMode: for append and overwrite mode, spark expects the table already exists, but file source doesn't follow. @rdblue are you ok with special-case file source? I'm a little worried about changing the behavior of file source in Spark 3.0. |
@cloud-fan, one of the goals of v2 is to avoid special cases for internal sources. I think we should continue to avoid them. I'm happy to discuss a proposal for this, but I think we need a real proposal that has been thought through. Quick PRs to just-make-it-work-right-away are the reason we have unknown and unpredictable behavior in v1, and we need to be deliberate for what is introduced in v2. I think it is fine to have cases where validation is turned off, but those need to be well-defined. This is another reason to introduce a v2 write API where the behavior is obvious to the caller. |
Sorry I missed replying to this comment earlier. The reason why we agreed to develop v2 in parallel is to avoid behavior changes. The behavior in v1 is too unpredictable to be confident that we have not changed anything. Plus, we know that some behaviors must necessarily change in v2 (standardizing on file sources behavior, when that is consistent). |
Test build #102479 has finished for PR 23829 at commit
|
@rdblue there are 2 problems here
For 1, I think we can introduce a new trait(or capability API) to indicate that a data source doesn't need schema validation during write. For 2, I think we need the CTAS(and RTAS) operator. One thing to note, How would the ongoing catalog API proposal solve this issue? |
That's my major concern. The behavior of |
I think we should add an interface for backward compatibility of course, and I thought that's what we agreed upon already. |
@HyukjinKwon I have created #23824 for adding a new interface. But got -1 from @cloud-fan and @rdblue . I agree on their comments. |
After thoughts, I don't think That is to say, let's remove the expression It just doesn’t have to be a table. We can use |
Create #23836 and close this one. |
@cloud-fan, here are my answers:
Validation should be configured by the source, just like we talked about for sources that can data with missing columns. I think the larger issue is finding out what the correct behavior is. What tables should opt out of validation? What tables should just use different rules, like allowing new columns?
In this case, the table catalog that supports path-based tables will check existence. If the path doesn't exist, then the table doesn't exist. Then the writer should use a |
What changes were proposed in this pull request?
Spark supports writing to file data sources without getting and validation with the table schema.
For example,
newDF
can be different with the original table schema.However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.
This PR proposes to process file sources as a special case in
DataFrameWriter.save()
. So that we can keep the original behavior for thisDataFrame
API.The PR also reeanbles write path of Orc data source V2.
How was this patch tested?
Unit test