-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this change affect some other tests for the overlapped cases like DataStreamReaderWriterSuite and |
Test build #82633 has finished for PR 19471 at commit
|
waiting for more feedbacks before moving forward :) Another thing I wanna point out: for It's been a long time since Spark 2.1 was released and no one reports this behavior change. It seems this is really a corner case and makes me feel we should not compilcate our code too much for it. |
Fair enough to me. To check this change reasonable, we might be able to send a dev/user list email to social feedbacks. I saw marmbrus doing so when adding the json API; |
+1 for this change. BTW, wow, there are lots of test case failures: 81 failures. |
Test build #82671 has finished for PR 19471 at commit
|
We may need to document this change in |
@cloud-fan Could you check why so many test cases failed? |
closing in favor of #19579 |
Test build #83066 has finished for PR 19471 at commit
|
Background
In Spark SQL, partition columns always appear at the end of the schema, even with user-specified schema:
This behavior also aligns with tables:
However, for historical reasons, Spark SQL supports partition columns appearing in data files, and respect the order of partition columns in data schema but pick the value from partition directories:
The behavior of this case is a little weird and have problems when dealing with tables(with hive metastore):
The reason of this bug is, when we respect the order of partition columns in data schema, we will get an invalid table schema which breaks the assumption that partition columns should be at the end.
Proposal
My proposal is: First we should always put partition columns at the end, to have a consistent behavior. Second we should ignore the partitions columns in data files when dealing with tables.
One problem is, we don't have corrected data/physical schema in metastore and may fail to read non-self-description file format like CSV. I think this is really a corner case(having overlapped columns in data and partition schema), and the table schema can't have overlapped columns in data and partition schema(unless we hack it into table properties), so we don't have a better choice.
Another problem is, for tables created before Spark 2.2, we may already have invalid table schema in metastore. We should handle this case and adjust table schema before reading the table.
Changed behavior
No behavior change if there is no overlapped columns in data and partition schema.
The schema changed(partition columns go to the end) when reading file format data source with partition columns in data files.