[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

cloud-fan · 2017-10-11T09:11:36Z

Background

In Spark SQL, partition columns always appear at the end of the schema, even with user-specified schema:

scala> Seq(1->1).toDF("i", "j").write.partitionBy("i").parquet("/tmp/t")

scala> spark.read.parquet("/tmp/t").show
+---+---+
|  j|  i|
+---+---+
|  1|  1|
+---+---+

scala> spark.read.schema("i int, j int").parquet("/tmp/t").show
+---+---+
|  j|  i|
+---+---+
|  1|  1|
+---+---+

scala> spark.read.schema("j int, i int").parquet("/tmp/t").show
+---+---+
|  j|  i|
+---+---+
|  1|  1|
+---+---+

This behavior also aligns with tables:

scala> sql("create table t(i int, j int) using parquet partitioned by (i)")
res5: org.apache.spark.sql.DataFrame = []

scala> spark.table("t").printSchema
root
 |-- j: integer (nullable = true)
 |-- i: integer (nullable = true)

However, for historical reasons, Spark SQL supports partition columns appearing in data files, and respect the order of partition columns in data schema but pick the value from partition directories:

scala> Seq(1->1, 2 -> 1).toDF("i", "j").write.parquet("/tmp/t/i=1")

// You can see the value of column i is always 1, so the value of partition columns are picked
// from partition directories.
scala> spark.read.parquet("/tmp/t").show
17/10/11 16:28:28 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `i`;
+---+---+
|  i|  j|
+---+---+
|  1|  1|
|  1|  1|
+---+---+

The behavior of this case is a little weird and have problems when dealing with tables(with hive metastore):

// With user-specified schema, partition columns are always at the end now.
scala> spark.read.schema("i int, j int").parquet("/tmp/t").show
+---+---+
|  j|  i|
+---+---+
|  1|  1|
|  1|  1|
+---+---+

scala> spark.read.schema("j int, i int").parquet("/tmp/t").show
+---+---+
|  j|  i|
+---+---+
|  1|  1|
|  1|  1|
+---+---+

// `skipHiveMetadata=true` simulates a hive-incompatible schema.
scala> sql("create table t using parquet options(skipHiveMetadata=true) location '/tmp/t'")
17/10/11 16:57:00 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `i`;
17/10/11 16:57:00 WARN HiveExternalCatalog: Persisting data source table `default`.`t` into Hive metastore inSpark SQL specific format, which is NOT compatible with Hive.
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:156)
  at org.apache.spark.sql.catalyst.catalog.CatalogTable.partitionSchema(interface.scala:242)
  at org.apache.spark.sql.hive.HiveExternalCatalog.newSparkSQLSpecificMetastoreTable$1(HiveExternalCatalog.scala:299)
...

The reason of this bug is, when we respect the order of partition columns in data schema, we will get an invalid table schema which breaks the assumption that partition columns should be at the end.

Proposal

My proposal is: First we should always put partition columns at the end, to have a consistent behavior. Second we should ignore the partitions columns in data files when dealing with tables.

One problem is, we don't have corrected data/physical schema in metastore and may fail to read non-self-description file format like CSV. I think this is really a corner case(having overlapped columns in data and partition schema), and the table schema can't have overlapped columns in data and partition schema(unless we hack it into table properties), so we don't have a better choice.

Another problem is, for tables created before Spark 2.2, we may already have invalid table schema in metastore. We should handle this case and adjust table schema before reading the table.

Changed behavior

No behavior change if there is no overlapped columns in data and partition schema.

The schema changed(partition columns go to the end) when reading file format data source with partition columns in data files.

cloud-fan · 2017-10-11T09:14:02Z

cc @rxin @brkyvz @liancheng @gatorsmile @maropu

maropu · 2017-10-11T10:42:01Z

Does this change affect some other tests for the overlapped cases like DataStreamReaderWriterSuite and OrcPartitionDiscoverySuite? Since we already have some amount of these tests in multiple places, (I know you've already considered this aspect though....) I'm a little worried about if this change in minor releases makes users confused.

SparkQA · 2017-10-11T10:44:45Z

Test build #82633 has finished for PR 19471 at commit ac7ae6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-11T11:20:02Z

waiting for more feedbacks before moving forward :)

Another thing I wanna point out: for sql("create table t using parquet options(skipHiveMetadata=true) location '/tmp/t'"), it works in Spark 2.0, and the created table has a schema that the partition column is at the beginning. In Spark 2.1, it also works, and DESC TABLE also shows the table schema has partition column at the beginning. However, if you query the table, the output schema has partition column at the end.

It's been a long time since Spark 2.1 was released and no one reports this behavior change. It seems this is really a corner case and makes me feel we should not compilcate our code too much for it.

maropu · 2017-10-11T11:39:02Z

Fair enough to me. To check this change reasonable, we might be able to send a dev/user list email to social feedbacks. I saw marmbrus doing so when adding the json API;
#15274 (comment)
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-JSON-Column-Support-td19132.html
If we have no response or positive feedbacks, we could quickly/safely drop the support.

dongjoon-hyun · 2017-10-11T17:04:15Z

+1 for this change. BTW, wow, there are lots of test case failures: 81 failures.

SparkQA · 2017-10-12T10:00:22Z

Test build #82671 has finished for PR 19471 at commit dea7037.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-10-15T04:00:43Z

We may need to document this change in Migration Guide in SQL programming guide.

gatorsmile · 2017-10-20T03:49:23Z

No behavior change if there is no overlapped columns in data and partition schema.

The schema changed(partition columns go to the end) when reading file format data source with partition columns in data files.

@cloud-fan Could you check why so many test cases failed?

cloud-fan · 2017-10-26T03:10:36Z

closing in favor of #19579

SparkQA · 2017-10-26T03:11:13Z

Test build #83066 has finished for PR 19471 at commit d21ebaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan force-pushed the partition branch from ac7ae6b to dea7037 Compare October 12, 2017 08:29

always put partition columns at the end

d21ebaa

cloud-fan force-pushed the partition branch from dea7037 to d21ebaa Compare October 26, 2017 00:25

cloud-fan closed this Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

Uh oh!

cloud-fan commented Oct 11, 2017 •

edited

Loading

Uh oh!

cloud-fan commented Oct 11, 2017

Uh oh!

maropu commented Oct 11, 2017

Uh oh!

SparkQA commented Oct 11, 2017

Uh oh!

cloud-fan commented Oct 11, 2017

Uh oh!

maropu commented Oct 11, 2017

Uh oh!

dongjoon-hyun commented Oct 11, 2017

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

viirya commented Oct 15, 2017

Uh oh!

gatorsmile commented Oct 20, 2017

Uh oh!

cloud-fan commented Oct 26, 2017

Uh oh!

SparkQA commented Oct 26, 2017

Uh oh!

Uh oh!

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

Uh oh!

Conversation

cloud-fan commented Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Proposal

Changed behavior

Uh oh!

cloud-fan commented Oct 11, 2017

Uh oh!

maropu commented Oct 11, 2017

Uh oh!

SparkQA commented Oct 11, 2017

Uh oh!

cloud-fan commented Oct 11, 2017

Uh oh!

maropu commented Oct 11, 2017

Uh oh!

dongjoon-hyun commented Oct 11, 2017

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

viirya commented Oct 15, 2017

Uh oh!

gatorsmile commented Oct 20, 2017

Uh oh!

cloud-fan commented Oct 26, 2017

Uh oh!

SparkQA commented Oct 26, 2017

Uh oh!

Uh oh!

cloud-fan commented Oct 11, 2017 •

edited

Loading