Skip to content

[SPARK-31136] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax #27894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 12, 2020

What changes were proposed in this pull request?

This is one of the drafts which we are discussion now to comply with new rubrics.
This PR (SPARK-31136) is created to see how many tests are affected.

Why are the changes needed?

This aims to comply the new community policy.

Does this PR introduce any user-facing change?

Yes. This restores the existing behavior to avoid breaking the existing user pipeline.

How was this patch tested?

Pass the Jenkins.

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119724 has finished for PR 27894 at commit be16bf5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

As expected, it fails. But, unfortunately, it failed at the first failure.

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119728 has finished for PR 27894 at commit da991af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119729 has finished for PR 27894 at commit 7765675.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2020

A lot of failures remains.

 org.apache.spark.sql.SQLQueryTestSuite.sql	1.1 sec	1
 org.apache.spark.sql.SQLQueryTestSuite.sql	3.2 sec	1
 org.apache.spark.sql.connector.DataSourceV2SQLSuite.CreateTable: without USING clause	2 ms	1
 org.apache.spark.sql.connector.DataSourceV2SQLSuite.CreateTableAsSelect: without USING clause	4 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.Test CTAS #3	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - basic	1 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - with database name	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - temporary	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - external	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - if not exists	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - comment	3 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - partitioned columns	3 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - clustered by	2 ms	1
 org.apache.spark.sql.execution.command.DDLParserSuite.create table - properties	1 ms	1
 org.apache.spark.sql.sources.InsertSuite.SPARK-29174 Support LOCAL in INSERT OVERWRITE DIRECTORY to data source

@cloud-fan
Copy link
Contributor

cloud-fan commented Mar 13, 2020

I agree that we should evaluate the "cost to break", but looking at unit tests may not be a good idea. They heavily rely on internal assumptions and changing the table format will definitely break a lot of unit tests.

Ideally, table format only decides how the table is stored and should be a performance thing, but hive table is a bit different. IMO, the "cost to break" is losing hive compatibility a bit: Now the tables created without USING may not be readable to hive, and some hive specific commands like LOAD TABLE doesn't work for them.

On the other hand, the "cost to maintain" is losing Spark's perf benefits: Many users just run CREATE TABLE like they do in other databases, which creates a hive table before 3.0. This means all the features we build for our native readers are not available, like the vectorized reader, nested column pruning, nested field filter pushdown (@dbtsai is working on it), bucketed table, etc.

I think in this case the "cost to maintain" is more serious and we should accept that change and don't revert it. cc @marmbrus @srowen @maropu @viirya @HyukjinKwon

UPDATED to make my opinion more clear.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2020

I'm not looking at the number of failures. Since the UT had better be independent to the global conf changes, we had better add guards with withSQLConf at least.

The following is not a cost at all. The users already know that the vectorization is only enabled in case of native tables, don't they? There exist user groups who have some reasons why they choose a vanilla Hive table. As a side note, some users prefer hive Parquet/ORC implementation to avoid Spark's CHAR/VARCHAR behavior difference.

On the other hand, the "cost to maintain" is losing Spark's perf benefits:

Please see the old mail thread.

This will also break the query result on CHAR type.

@dongjoon-hyun
Copy link
Member Author

Hi, All. I marked SPARK-31136 as a correctness issue and added the example.

@HyukjinKwon
Copy link
Member

Looks coherent to revert it per @marmbrus's policy and reverts made so far, although I personally are not really convinced that we should revert.

Reverting this change discourages users to leverage Spark's new features and edge optimizations, and let users more stick to Hive. What's your thought @marmbrus?

@HyukjinKwon
Copy link
Member

cc @rxin too

@dongjoon-hyun
Copy link
Member Author

It just became a correctness issue due to the another example of CHAR type.

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119948 has finished for PR 27894 at commit 1a68632.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2020

Test build #119967 has finished for PR 27894 at commit 9bff392.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Mar 18, 2020

Test build #119971 has finished for PR 27894 at commit 9bff392.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review March 18, 2020 17:11
cloud-fan pushed a commit that referenced this pull request Mar 20, 2020
…TE TABLE test cases

### What changes were proposed in this pull request?

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via #27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

### Why are the changes needed?

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member Author

Thank you all! I'm going to close this PR because this is discussed explicitly.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-31136 branch March 20, 2020 04:37
dongjoon-hyun added a commit that referenced this pull request Mar 20, 2020
…TE TABLE test cases

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via #27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

No.

Pass the Jenkins with the existing tests.

Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f1cc867)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…TE TABLE test cases

### What changes were proposed in this pull request?

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via apache#27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

### Why are the changes needed?

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes apache#27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants