Skip to content

[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 14, 2018

What changes were proposed in this pull request?

After SPARK-20682, Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises org.apache.spark.sql.catalyst.parser.ParseException.

This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has a single column with a fixed name 'value'.

How was this patch tested?

Pass the newly added test case.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23072][SQL] Add a Unicode schema test for file-based data sources [SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources Jan 14, 2018
@SparkQA
Copy link

SparkQA commented Jan 14, 2018

Test build #86122 has finished for PR 20266 at commit f9a35f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

cc @gatorsmile and @cloud-fan .

@@ -2773,4 +2773,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

Seq("orc", "parquet", "csv", "json").foreach { format =>
test(s"Write and read back unicode schema - $format") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of keeping adding test cases in SQLQuerySuite, shall we create a dedicate test suite for file based data source now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. That's a best idea. I'll update like that.

@SparkQA
Copy link

SparkQA commented Jan 15, 2018

Test build #86140 has finished for PR 20266 at commit 60b8e43.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext

@SparkQA
Copy link

SparkQA commented Jan 15, 2018

Test build #86141 has finished for PR 20266 at commit 144c596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext

Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
test(s"Writing empty datasets should not fail - $format") {
withTempDir { dir =>
Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why add /tmp at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's fixed by using withTempPath.

}

// Only New OrcFileFormat supports this
Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.orc.impl is native by default, can we just use "orc" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86159 has finished for PR 20266 at commit 5afaa28.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86162 has finished for PR 20266 at commit 5afaa28.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86178 has started for PR 20266 at commit 5afaa28.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 16, 2018

@mgaido91 . That suite is using SQL Syntax and Hive metastore. Here, it's only using in-memory catalog.

@mgaido91
Copy link
Contributor

@dongjoon-hyun the test case I referred to (the one related to SPARK-22146) doesn't seem to use either of them to me. It is only about reading files with special chars.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 16, 2018

Oh, I thought you mentioned the suite, @mgaido91 . Sorry! I agree with you.

@dongjoon-hyun
Copy link
Member Author

Anyway, Jenkins seems to be out of order now.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86183 has finished for PR 20266 at commit fb708b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86184 has finished for PR 20266 at commit 8fec65b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

LGTM

}

Seq("orc", "parquet", "csv", "json").foreach { format =>
test(s"SPARK-23072 Write and read back unicode schema - $format") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unicode schema -> unicode column names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

}
}

Seq("orc", "parquet").foreach { format =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only these two formats support it? If so, please add the comments.

This is the same comment to the other test cases. Otherwise, add all of them for each test case.

You can define a global Seq to include all the built-in file formats we support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

  1. Only two support this. I added comments.
  2. For the other test cases, I did.
  3. I added a global Seq, allFileBasedDataSources.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86227 has finished for PR 20266 at commit c67809c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 17, 2018
…a sources

## What changes were proposed in this pull request?

After [SPARK-20682](#19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`.

This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71).

## How was this patch tested?

Pass the newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #20266 from dongjoon-hyun/SPARK-23072.

(cherry picked from commit a0aedb0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in a0aedb0 Jan 17, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan , @gatorsmile , and @mgaido91 !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23072 branch January 17, 2018 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants