-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-16628][SQL] Translate file-based relation schema when file schema is inconsistent with catalog schema #14365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… for file-based data source relation.
Test build #62875 has finished for PR 14365 at commit
|
Test build #62880 has finished for PR 14365 at commit
|
cc @cloud-fan @yhuai @liancheng @rxin Please review this change. Thanks. |
/** | ||
* An interface for mapping two different schemas. For the relations that have are backed by files, | ||
* the inferred schema from the files might be different with the schema stored in the catalog. In | ||
* such case, the interface helps mapping inconsistent schemas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the detailed description of this mapping in the doc here? thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more detailed document. Please take a look. Thanks.
Test build #62993 has finished for PR 14365 at commit
|
@cloud-fan I've submitted a PR #14282 previously that disables the conversion if the schema is inconsistent. |
…sistent-schema Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Test build #63214 has finished for PR 14365 at commit
|
@cloud-fan So do we have decision on this? A simpler approach to disable the conversion if the schema is inconsistent, or a complex one to work around this Hive bug? |
ping @cloud-fan Can you have a decision about this? Are we going to have complex logic for this issue? Or just disable it? |
I noticed that And we may stop inferring file schema when reading ORC tables, for performance reasons(#16980), then we don't have the chance to detect the mismatch and disable this feature. We may remove this feature entirely, or thinking harder to come up with a better fix. cc @yhuai |
@cloud-fan actually, this conversion was disabled because of this bug. btw, pr that @cloud-fan mentioned is #14690. I think it is better to hold this change before we finish #14690. |
Hi, @viirya , @cloud-fan , @yhuai . |
As we replace schema inferring with metastore schema completely by #14690 for converted Hive tables, we may not have the chance to detect the mismatch between Orc file's physical schema and metastore schema. |
It seems to exist some cases like the following (on the current master). On hive, CREATE TABLE t1 (a string) PARTITIONED BY (b string) STORED AS ORC;
INSERT INTO TABLE t1 PARTITION (b='01') VALUES ('1'); On scala> sql("select * from t1").show
+---+---+
| a| b|
+---+---+
| 1| 01|
+---+---+
scala> sql("set spark.sql.hive.convertMetastoreOrc=true").show
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...| true|
+--------------------+-----+
scala> sql("select * from t1").show
16/12/07 05:17:19 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IllegalArgumentException: Field "a" does not exist. |
@dongjoon-hyun yeah, I see. Because we directly use metastore schema of converted Orc table, when the physical schema in Orc file and metastore schema mismatch, this issue happens. |
We have two options. First one is to map metastore schema to physical Orc schema like this. But we don't infer physical schema of Orc file now. I will update this to have this mapping in OrcFileFormat. Another one is like #14282. But as we don't infer schema from Orc file now, we can't disable the conversion when the mismatch is detected. One possible is to throw exception in OrcFileFormat when detecting the mismatch before reading and show message to ask user to disable @cloud-fan @yhuai @dongjoon-hyun What do you think? |
Test build #69852 has started for PR 14365 at commit |
@cloud-fan @yhuai @dongjoon-hyun I've updated this as:
Please let me know what you think about this approach. Thanks. |
retest this please. |
Test build #69856 has finished for PR 14365 at commit
|
@@ -303,6 +330,10 @@ private[orc] object OrcRelation extends HiveInspectors { | |||
maybeStructOI.map(unwrap).getOrElse(Iterator.empty) | |||
} | |||
|
|||
def isMismatchSchema(physicalSchema: StructType, requestedSchema: StructType): Boolean = { | |||
requestedSchema.forall(a => physicalSchema.getFieldIndex(a.name).isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, the following?
!requestedSchema.forall(a => physicalSchema.getFieldIndex(a.name).isDefined)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, never mind. It's really about checking whether all requested columns are not matched.
It looks good to me. Thank you for updating, @viirya ! |
ping @cloud-fan @yhuai May you take a look? Thanks. |
Test build #70336 has finished for PR 14365 at commit
|
What changes were proposed in this pull request?
We will convert Metastore Orc tables (represented by
MetastoreRelation
) to datasource tables (represented byHadoopFsRelation
) ifspark.sql.hive.convertMetastoreOrc
is enabled for better performance.However, due to a Hive issue, an Orc table created by Hive does not store column name correctly in the Orc files. For these Orc tables, the converted relation has wrong schema.
To fix this, we assume the metastore schema
dataSchema
can match tophysicalSchema
by each column disregarding the column names. If not, we throw an exception that suggests users to disable the conversion of Hive Orc tables.How was this patch tested?
Jenkins tests.