Skip to content

[SPARK-23421] [SQL] Document the behavior change in SPARK-22356 #20606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1963,6 +1963,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see
## Upgrading From Spark SQL 2.1 to 2.2

- Spark 2.1.1 introduced a new configuration key: `spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of `NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 changes this setting's default value to `INFER_AND_SAVE` to restore compatibility with reading Hive metastore tables whose underlying file schema have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on first access Spark will perform schema inference on any Hive metastore table for which it has not already saved an inferred schema. Note that schema inference can be a very time consuming operation for tables with thousands of partitions. If compatibility with mixed-case column names is not a concern, you can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to avoid the initial overhead of schema inference. Note that with the new default `INFER_AND_SAVE` setting, the results of the schema inference are saved as a metastore key for future use. Therefore, the initial schema inference occurs only at a table's first access.

- Since Spark 2.2.1 and 2.3.0, the schema is always inferred at runtime when the data source tables have the columns that exist in both partition schema and data schema. The inferred schema does not have the partitioned columns. When reading the table, Spark respects the partition values of these overlapping columns instead of the values stored in the data source files. In 2.2.0 and 2.1.x release, the inferred schema is partitioned but the data of the table is invisible to users (i.e., the result set is empty).

## Upgrading From Spark SQL 2.0 to 2.1

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {

object PROCESS_TABLES extends QueryTest with SQLTestUtils {
// Tests the latest version of every release line.
val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mix this into a PR with title Document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to verify what we explained in doc is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Please update the title to include that, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already has it in the PR description.

Copy link
Member Author

@gatorsmile gatorsmile Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal of this PR is not to test it but to document it. We need another backport PR to SPARK 2.2 without the test.


protected var spark: SparkSession = _

Expand Down Expand Up @@ -249,7 +249,7 @@ object PROCESS_TABLES extends QueryTest with SQLTestUtils {

// SPARK-22356: overlapped columns between data and partition schema in data source tables
val tbl_with_col_overlap = s"tbl_with_col_overlap_$index"
// For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0.
// For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0, 2.2.1, 2.3+
if (testingVersions(index).startsWith("2.1") || testingVersions(index) == "2.2.0") {
spark.sql("msck repair table " + tbl_with_col_overlap)
assert(spark.table(tbl_with_col_overlap).columns === Array("i", "j", "p"))
Expand Down