Skip to content

[SPARK-16628][SQL] Translate file-based relation schema when file schema is inconsistent with catalog schema #14365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 26, 2016

What changes were proposed in this pull request?

We will convert Metastore Orc tables (represented by MetastoreRelation) to datasource tables (represented by HadoopFsRelation) if spark.sql.hive.convertMetastoreOrc is enabled for better performance.

However, due to a Hive issue, an Orc table created by Hive does not store column name correctly in the Orc files. For these Orc tables, the converted relation has wrong schema.

To fix this, we assume the metastore schema dataSchema can match to physicalSchema by each column disregarding the column names. If not, we throw an exception that suggests users to disable the conversion of Hive Orc tables.

How was this patch tested?

Jenkins tests.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62875 has finished for PR 14365 at commit f9fab59.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62880 has finished for PR 14365 at commit cc953a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 28, 2016

cc @cloud-fan @yhuai @liancheng @rxin Please review this change. Thanks.

/**
* An interface for mapping two different schemas. For the relations that have are backed by files,
* the inferred schema from the files might be different with the schema stored in the catalog. In
* such case, the interface helps mapping inconsistent schemas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the detailed description of this mapping in the doc here? thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more detailed document. Please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #62993 has finished for PR 14365 at commit 0c17748.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 2, 2016

cc @yhuai @liancheng @rxin

@cloud-fan
Copy link
Contributor

@rxin @yhuai , is it worth adding this complex logic to work around this hive bug? This feature(convert Metastore Orc tables to data source table) is not a major feature and the broken case seems a corner case. I think it's ok to disable this feature when we meet this hive bug, what do you think?

@viirya
Copy link
Member Author

viirya commented Aug 2, 2016

@cloud-fan I've submitted a PR #14282 previously that disables the conversion if the schema is inconsistent.

…sistent-schema

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #63214 has finished for PR 14365 at commit 5cfdff1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 12, 2016

@cloud-fan So do we have decision on this? A simpler approach to disable the conversion if the schema is inconsistent, or a complex one to work around this Hive bug?

@viirya
Copy link
Member Author

viirya commented Oct 6, 2016

ping @cloud-fan Can you have a decision about this? Are we going to have complex logic for this issue? Or just disable it?

@cloud-fan
Copy link
Contributor

I noticed that spark.sql.hive.convertMetastoreOrc is false by default, so this bug is not so critical.

And we may stop inferring file schema when reading ORC tables, for performance reasons(#16980), then we don't have the chance to detect the mismatch and disable this feature.

We may remove this feature entirely, or thinking harder to come up with a better fix. cc @yhuai

@yhuai
Copy link
Contributor

yhuai commented Oct 10, 2016

@cloud-fan actually, this conversion was disabled because of this bug.

btw, pr that @cloud-fan mentioned is #14690.

I think it is better to hold this change before we finish #14690.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 2, 2016

Hi, @viirya , @cloud-fan , @yhuai .
#14690 and the parent issue SPARK-17861 seems to be resolved.
What about restarting this issue?

@viirya
Copy link
Member Author

viirya commented Dec 7, 2016

As we replace schema inferring with metastore schema completely by #14690 for converted Hive tables, we may not have the chance to detect the mismatch between Orc file's physical schema and metastore schema.

@dongjoon-hyun
Copy link
Member

It seems to exist some cases like the following (on the current master).

On hive,

CREATE TABLE t1 (a string) PARTITIONED BY (b string) STORED AS ORC;
INSERT INTO TABLE t1 PARTITION (b='01') VALUES ('1');

On spark-shell,

scala> sql("select * from t1").show
+---+---+
|  a|  b|
+---+---+
|  1| 01|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreOrc=true").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.hive.co...| true|
+--------------------+-----+

scala> sql("select * from t1").show
16/12/07 05:17:19 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IllegalArgumentException: Field "a" does not exist.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2016

@dongjoon-hyun yeah, I see. Because we directly use metastore schema of converted Orc table, when the physical schema in Orc file and metastore schema mismatch, this issue happens.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2016

We have two options. First one is to map metastore schema to physical Orc schema like this. But we don't infer physical schema of Orc file now. I will update this to have this mapping in OrcFileFormat.

Another one is like #14282. But as we don't infer schema from Orc file now, we can't disable the conversion when the mismatch is detected. One possible is to throw exception in OrcFileFormat when detecting the mismatch before reading and show message to ask user to disable spark.sql.hive.convertMetastoreOrc.

@cloud-fan @yhuai @dongjoon-hyun What do you think?

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69852 has started for PR 14365 at commit 2bb8368.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2016

@cloud-fan @yhuai @dongjoon-hyun I've updated this as:

  • Assume metastore schema matches with physical Orc schema by column, disregarding column names.

  • Mapping required schema to columns in physical Orc schema.

  • If the length or data types of metastore schema and physical schema is not matched, throw an exception suggesting users to disable spark.sql.hive.convertMetastoreOrc.

Please let me know what you think about this approach. Thanks.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69856 has finished for PR 14365 at commit 2bb8368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -303,6 +330,10 @@ private[orc] object OrcRelation extends HiveInspectors {
maybeStructOI.map(unwrap).getOrElse(Iterator.empty)
}

def isMismatchSchema(physicalSchema: StructType, requestedSchema: StructType): Boolean = {
requestedSchema.forall(a => physicalSchema.getFieldIndex(a.name).isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, the following?

!requestedSchema.forall(a => physicalSchema.getFieldIndex(a.name).isDefined)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind. It's really about checking whether all requested columns are not matched.

@dongjoon-hyun
Copy link
Member

It looks good to me. Thank you for updating, @viirya !

@viirya
Copy link
Member Author

viirya commented Dec 19, 2016

ping @cloud-fan @yhuai May you take a look? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70336 has finished for PR 14365 at commit 141cb1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya closed this Feb 23, 2017
@viirya viirya deleted the fix-file-ds-inconsistent-schema branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants