Skip to content

[SPARK-26915][SQL]File source should write without schema validation in DataFrameWriter.save() #23829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Spark supports writing to file data sources without getting and validation with the table schema.
For example,

spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)
  1. There is no need to get/infer the schema from the table/path
  2. The schema of newDF can be different with the original table schema.

However, from https://github.com/apache/spark/pull/23606/files#r255319992 we can see that the feature above is missing in data source V2. Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

This PR proposes to process file sources as a special case in DataFrameWriter.save(). So that we can keep the original behavior for this DataFrame API.
The PR also reeanbles write path of Orc data source V2.

How was this patch tested?

Unit test

…tion tests for FileDataSourceV2 (partially revert )"

This reverts commit a0e81fc.
@gengliangwang
Copy link
Member Author

@cloud-fan @rdblue

@cloud-fan
Copy link
Contributor

I think this is the case where the file source has unexpected behavior of SaveMode: for append and overwrite mode, spark expects the table already exists, but file source doesn't follow.

@rdblue are you ok with special-case file source? I'm a little worried about changing the behavior of file source in Spark 3.0.

@rdblue
Copy link
Contributor

rdblue commented Feb 18, 2019

@cloud-fan, one of the goals of v2 is to avoid special cases for internal sources. I think we should continue to avoid them.

I'm happy to discuss a proposal for this, but I think we need a real proposal that has been thought through. Quick PRs to just-make-it-work-right-away are the reason we have unknown and unpredictable behavior in v1, and we need to be deliberate for what is introduced in v2.

I think it is fine to have cases where validation is turned off, but those need to be well-defined. This is another reason to introduce a v2 write API where the behavior is obvious to the caller.

@rdblue
Copy link
Contributor

rdblue commented Feb 18, 2019

I'm a little worried about changing the behavior of file source in Spark 3.0.

Sorry I missed replying to this comment earlier. The reason why we agreed to develop v2 in parallel is to avoid behavior changes. The behavior in v1 is too unpredictable to be confident that we have not changed anything. Plus, we know that some behaviors must necessarily change in v2 (standardizing on file sources behavior, when that is consistent).

@SparkQA
Copy link

SparkQA commented Feb 18, 2019

Test build #102479 has finished for PR 23829 at commit 8316c07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 19, 2019

@rdblue there are 2 problems here

  1. file source should not have schema validation during write
  2. file source can't report schema during write, if the output path doesn't exist

For 1, I think we can introduce a new trait(or capability API) to indicate that a data source doesn't need schema validation during write.

For 2, I think we need the CTAS(and RTAS) operator.

One thing to note, DataFrameWriter API can mix data and metadata operations. e.g. df.mode("append") can append data to a non-existing table, with CTAS semantic. I can't find a corresponding SQL operator, maybe we need to create a new one like CREATE OR INSERT TABLE ... AS SELECT.

How would the ongoing catalog API proposal solve this issue?

@gengliangwang
Copy link
Member Author

Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.

That's my major concern. The behavior of DataFrameWriter.save is so different from table insertion in traditional databases.

@HyukjinKwon
Copy link
Member

I think we should add an interface for backward compatibility of course, and I thought that's what we agreed upon already.

@gengliangwang
Copy link
Member Author

@HyukjinKwon I have created #23824 for adding a new interface. But got -1 from @cloud-fan and @rdblue . I agree on their comments.

@gengliangwang
Copy link
Member Author

gengliangwang commented Feb 19, 2019

After thoughts, I don't think DataFrameWriter.save has to be involved with the table concept. Otherwise, things can be complex and there can be two different behaviors in one API DataFrameWriter.save.

That is to say, let's remove the expression AppendData and OverwriteByExpression in DataFrameWriter.save, since their behaviors are different with the API's.

It just doesn’t have to be a table. We can use AppendData and OverwriteByExpression in DataFrameWriter.saveAsTable, which is more appropriate.

@gengliangwang
Copy link
Member Author

Create #23836 and close this one.

@rdblue
Copy link
Contributor

rdblue commented Feb 19, 2019

@cloud-fan, here are my answers:

  1. file source should not have schema validation during write

Validation should be configured by the source, just like we talked about for sources that can data with missing columns.

I think the larger issue is finding out what the correct behavior is. What tables should opt out of validation? What tables should just use different rules, like allowing new columns?

  1. file source can't report schema during write, if the output path doesn't exist

In this case, the table catalog that supports path-based tables will check existence. If the path doesn't exist, then the table doesn't exist. Then the writer should use a CreateTableAsSelect plan instead of an overwrite plan. CTAS doesn't validation against an existing schema, it creates the table using the given schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants