[SPARK-25391][SQL] Make behaviors consistent when converting parquet hive table to parquet data source #22343

seancxmao · 2018-09-05T15:11:16Z

What changes were proposed in this pull request?

parquet data source tables and hive parquet tables have different behaviors about parquet field resolution. So, when spark.sql.hive.convertMetastoreParquet is true, users might face inconsistent behaviors. The differences are:

Whether respect spark.sql.caseSensitive. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, both data source tables and hive tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed.
How to resolve ambiguity in case-insensitive mode. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive tables always return the first matched column ignoring cases. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables throw exception when there is ambiguity while hive table behavior is not changed.

This PR aims to make behaviors consistent when converting hive table to data source table.

The behavior must be consistent to do the conversion, so we skip the conversion in case-sensitive mode because hive parquet table always do case-insensitive field resolution.
In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.

How was this patch tested?

Unit tests added.

…he conversion

seancxmao · 2018-09-08T05:54:39Z

@HyukjinKwon @cloud-fan @gatorsmile Could you please kindly help review this if you have time?

cloud-fan · 2018-09-10T03:38:47Z

ok to test

cloud-fan · 2018-09-10T03:41:27Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

-    caseSensitive = true)
+    conf = {
+      val conf = new Configuration()
+      conf.setBoolean(SQLConf.CASE_SENSITIVE.key, true)


isn't it the default value?

@cloud-fan There is no default value for spark.sql.caseSensitive in Configuration. Let me explain in more details below.

This is one of the overloaded methods of testSchemaClipping. I tried to give this testSchemaClipping method a default conf, however Scalac complains that

in class ParquetSchemaSuite, multiple overloaded alternatives of testSchemaClipping define default arguments

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

Lines 1014 to 1019 in 95673cd

private def testSchemaClipping(

testName: String,

parquetSchema: String,

catalystSchema: StructType,

expectedSchema: String,

conf: Configuration = {

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

Lines 1028 to 1033 in 95673cd

private def testSchemaClipping(

testName: String,

parquetSchema: String,

catalystSchema: StructType,

expectedSchema: MessageType,

conf: Configuration): Unit = {

It seems a little confusing, because these two methods have different parameter types. After a brief investigation, I found Scala compiler simply disallows overloaded methods with default arguments even when these methods have different parameter types.

https://stackoverflow.com/questions/4652095/why-does-the-scala-compiler-disallow-overloaded-methods-with-default-arguments

cloud-fan · 2018-09-10T03:44:07Z

@dongjoon-hyun does the orc conversion need the same fix?

dongjoon-hyun · 2018-09-10T03:54:49Z

Thank you for pinging me. I'll take a look tomorrow, @cloud-fan .

BTW, @seancxmao . Can we handle this convertMetastoreXXX case in a new JIRA issue? The behavior must be consistent to do the conversion doesn't look good to me because it's not complete as a single patch title.

HyukjinKwon · 2018-09-10T04:57:49Z

@seancxmao, mind fixing the PR title BTW? For instance, looks unclear which behaviour you mean in the PR title.

HyukjinKwon · 2018-09-10T05:01:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

+   * behavior as hive parquet table - to keep behaviors consistent.
+   */
+  val duplicatedFieldsResolutionMode: String = {
+    parameters.getOrElse(DUPLICATED_FIELDS_RESOLUTION_MODE,


I don't think we should leave this for Parquet options for now. Can we just have a SQL config to control this?

whether we have a SQL config for it or not, we must define an option here. The conversion happens per-query, so we must have a per-query option to switch the behavior, instead of a per-session SQL config.

The conversion itself happens per query but my impression is that the different values don't usually happen in per-query. I mean, I was wondering if users want to set this query by query.

I agree this is a little unusual. Usually we have a SQL config first, then we create an option for it if necessary. In this case, we are not adding a config/option from user's requirement, but we need it for an internal optimization.

If we can I would suggest we make it an internal option. But anyway we shouldn't rush to add a SQL config, until we get requirement from users.

seancxmao · 2018-09-10T05:55:23Z

@dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a more complete and clear title for this PR. What do you think?

SparkQA · 2018-09-10T07:05:02Z

Test build #95857 has finished for PR 22343 at commit 95673cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-10T07:16:40Z

retest this please

SparkQA · 2018-09-10T11:12:49Z

Test build #95864 has finished for PR 22343 at commit 95673cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-10T17:17:00Z

Hi, @seancxmao . Should we be consistent? IIRC, all the previous PR raises Exception to prevent any potential issues. In this case, I have a feeling that convertMetastoreXXX should be used to prevent the problem of Hive behavior by raising Exception, not hiding the problem of Hive behavior.

In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.

seancxmao · 2018-09-11T03:41:02Z

Hi, @dongjoon-hyun
When we find duplicated field names in the case of convertMetastoreXXX, we have 2 options
(1) raise exception as parquet data source. To most of end users, they do not know the difference between hive parquet table and parquet data source. If the conversion leads to different behaviors, they may be confused, and in some cases even lead to tricky data issues silently.
(2) Adjust behaviors of parquet data source to keep behaviors consistent. This seems more friendly to end users, and avoid any potential issues introduced by the conversion.

BTW, for parquet data source which is not converted from hive parquet table, we raise exception when there is ambiguity, sine this is more intuitive and reasonable.

dongjoon-hyun · 2018-09-11T04:38:18Z

What I asked was the following, wasn't it?

In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.

Spark should not pick up the first matched field in any cases because it's considered as a correctness behavior in previous PR which is backported to branch-2.3 #22183. I don't think we need to follow incorrect Hive behavior.

seancxmao · 2018-09-11T05:38:34Z

@dongjoon-hyun It is a little complicated. There has been a discussion about this in #22184. Below are some key comments from @cloud-fan and @gatorsmile, just FYI.

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184 (comment)
[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184 (comment)

BTW, finally it is decided that #22148 should not be backported to branch-2.3.

cloud-fan · 2018-09-11T07:59:06Z

To clarify: this is just a workaround when we hit a problematic(having case-insensitive duplicated filed names in the parquet file) hive parquet tables and we want to read it with the native parquet reader. The hive behaivor is weird but we need to follow it as we are reading a hive table.

Personally I think it's not a big deal. If the hive table is malformed, I think we don't have to follow hive's bugy behavior. If people are confused by this patch and think this doesn't worth, I'm ok to just leave it.

dongjoon-hyun · 2018-09-11T08:03:56Z

Thank you for the pointer, @seancxmao . And thank you for clarification, @cloud-fan .

It looks like we are re-creating correctness issue somewhat in this PR when caseSensitive=true.

BEFORE THIS PR (master)

scala> sql("INSERT OVERWRITE DIRECTORY '/tmp/hive_t' STORED AS PARQUET SELECT 'A', 'a'")
scala> sql("CREATE TABLE hive_t(a STRING) STORED AS PARQUET LOCATION '/tmp/hive_t'")
scala> sql("CREATE TABLE spark_t(a STRING) USING PARQUET LOCATION '/tmp/hive_t'")
scala> sql("set spark.sql.caseSensitive=true")
scala> spark.table("hive_t").show
+---+
|  a|
+---+
|  a|
+---+

scala> spark.table("spark_t").show
+---+
|  a|
+---+
|  a|
+---+

AFTER THIS PR

scala> sql("set spark.sql.caseSensitive=true")
scala> spark.table("hive_t").show
+---+
|  a|
+---+
|  A|
+---+

scala> spark.table("spark_t").show
+---+
|  a|
+---+
|  a|
+---+

seancxmao · 2018-09-11T09:14:11Z

Could we see this as a behavior change? We can add a legacy conf (e.g. spark.sql.hive.legacy.convertMetastoreParquet, may be defined in HiveUtils) to enable users to revert back to the previous behavior for backward compatibility. If this legacy conf is set to true, behaviors will be reverted both in case-sensitive and case-insensitive mode.

caseSensitive	legacy behavior	new behavior
true	convert anyway	skip conversion, log warning message
false	convert, fail if there's ambiguity	convert, first match if there's ambiguity

dongjoon-hyun · 2018-09-11T16:07:28Z

@seancxmao . For Hive compatibility, spark.sql.hive.convertMetastoreParquet=false looks enough to me.

seancxmao · 2018-09-12T02:10:00Z

It keeps Hive compatibility but loses performance benefit by setting spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the conversion and still keeping Hive compatibility. Though this makes our implementation more complex, I guess most end users may keep spark.sql.hive.convertMetastoreParquet=true and spark.sql.caseSensitive=false which are default values, this brings benefits to end users.

dongjoon-hyun · 2018-09-12T04:06:28Z

Compatibility is not a gold rule if it sacrifices correctness. Fast and wrong result doesn't looks like benefits to me. Do you think the customer want to get a wrong result like Hive?

seancxmao · 2018-09-12T07:16:00Z

I agree that correctness is more important. If we should not make behaviors consistent when do the conversion, I will close this PR. @cloud-fan @gatorsmile what do you think?

dongjoon-hyun · 2018-09-12T15:16:30Z

Thank you for understanding, @seancxmao .

Also, I want to make additional note in this PR. The following is a well-known example of Hive incompatibility since Apache Spark 1.6.3. We get a correct result only when spark.sql.hive.convertMetastoreParquet=false. The user should know what they are using.

scala> sql("CREATE TABLE t1(a CHAR(3))")
scala> sql("CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET")

scala> sql("INSERT INTO TABLE t1 SELECT 'a '")
scala> sql("INSERT INTO TABLE t3 SELECT 'a '")

scala> sql("SELECT a, length(a) FROM t1").show
+---+---------+
|  a|length(a)|
+---+---------+
|a  |        3|
+---+---------+

scala> sql("SELECT a, length(a) FROM t3").show
+---+---------+
|  a|length(a)|
+---+---------+
| a |        2|
+---+---------+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.hive.co...|false|
+--------------------+-----+


scala> sql("SELECT a, length(a) FROM t3").show
+---+---------+
|  a|length(a)|
+---+---------+
|a  |        3|
+---+---------+

dongjoon-hyun · 2018-09-16T04:53:27Z

Could you close this PR and JIRA, @seancxmao ?

seancxmao · 2018-09-16T06:01:16Z

Sure, close this PR. Thank you all for your time and insights.

dongjoon-hyun · 2018-09-16T18:38:39Z

Thank YOU for your PR and open discussion on this, @seancxmao . Let's see in another PRs.

[SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do t…

95673cd

…he conversion

seancxmao mentioned this pull request Sep 5, 2018

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

Closed

cloud-fan reviewed Sep 10, 2018

View reviewed changes

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

seancxmao changed the title ~~[SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the conversion~~ [SPARK-25391][SQL] Make behaviors consistent when converting parquet hive table to parquet data source Sep 10, 2018

seancxmao closed this Sep 16, 2018

	private def testSchemaClipping(
	testName: String,
	parquetSchema: String,
	catalystSchema: StructType,
	expectedSchema: String,
	conf: Configuration = {

[SPARK-25391][SQL] Make behaviors consistent when converting parquet hive table to parquet data source #22343

[SPARK-25391][SQL] Make behaviors consistent when converting parquet hive table to parquet data source #22343

Uh oh!

Conversation

seancxmao commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

seancxmao commented Sep 8, 2018

Uh oh!

cloud-fan commented Sep 10, 2018

Uh oh!

cloud-fan Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

seancxmao Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

seancxmao commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

cloud-fan commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

dongjoon-hyun commented Sep 10, 2018

Uh oh!

seancxmao commented Sep 11, 2018

Uh oh!

dongjoon-hyun commented Sep 11, 2018

Uh oh!

seancxmao commented Sep 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 11, 2018

Uh oh!

dongjoon-hyun commented Sep 11, 2018

Uh oh!

seancxmao commented Sep 11, 2018

Uh oh!

dongjoon-hyun commented Sep 11, 2018

Uh oh!

seancxmao commented Sep 12, 2018

Uh oh!

dongjoon-hyun commented Sep 12, 2018

Uh oh!

seancxmao commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 12, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

seancxmao commented Sep 16, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

Uh oh!

seancxmao commented Sep 5, 2018 •

edited

Loading

seancxmao commented Sep 11, 2018 •

edited

Loading

seancxmao commented Sep 12, 2018 •

edited

Loading