[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266

dongjoon-hyun · 2018-01-14T16:54:46Z

What changes were proposed in this pull request?

After SPARK-20682, Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises org.apache.spark.sql.catalyst.parser.ParseException.

This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has a single column with a fixed name 'value'.

How was this patch tested?

Pass the newly added test case.

SparkQA · 2018-01-14T20:02:37Z

Test build #86122 has finished for PR 20266 at commit f9a35f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-01-14T20:05:43Z

cc @gatorsmile and @cloud-fan .

cloud-fan · 2018-01-15T15:10:31Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -2773,4 +2773,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
      }
    }
  }
+
+  Seq("orc", "parquet", "csv", "json").foreach { format =>
+    test(s"Write and read back unicode schema - $format") {


instead of keeping adding test cases in SQLQuerySuite, shall we create a dedicate test suite for file based data source now?

+1. That's a best idea. I'll update like that.

SparkQA · 2018-01-15T22:44:05Z

Test build #86140 has finished for PR 20266 at commit 60b8e43.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext

SparkQA · 2018-01-15T23:26:59Z

Test build #86141 has finished for PR 20266 at commit 144c596.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext

cloud-fan · 2018-01-16T03:17:40Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+  Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
+    test(s"Writing empty datasets should not fail - $format") {
+      withTempDir { dir =>
+        Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp")


nit: why add /tmp at the end?

Yep. It's fixed by using withTempPath.

cloud-fan · 2018-01-16T03:20:57Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+  }
+
+  // Only New OrcFileFormat supports this
+  Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName,


spark.sql.orc.impl is native by default, can we just use "orc" here?

cloud-fan · 2018-01-16T03:21:14Z

LGTM

SparkQA · 2018-01-16T08:05:01Z

Test build #86159 has finished for PR 20266 at commit 5afaa28.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-16T08:27:55Z

retest this please

SparkQA · 2018-01-16T11:23:32Z

Test build #86162 has finished for PR 20266 at commit 5afaa28.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-01-16T13:41:01Z

nit: since we are creating a new test suite what about moving also https://github.com/dongjoon-hyun/spark/blob/5afaa2836133cfc18a52de38d666817991d62c5d/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala#L1347 there?

dongjoon-hyun · 2018-01-16T15:58:27Z

Retest this please

SparkQA · 2018-01-16T16:02:51Z

Test build #86178 has started for PR 20266 at commit 5afaa28.

dongjoon-hyun · 2018-01-16T17:15:24Z

@mgaido91 . That suite is using SQL Syntax and Hive metastore. Here, it's only using in-memory catalog.

mgaido91 · 2018-01-16T17:40:39Z

@dongjoon-hyun the test case I referred to (the one related to SPARK-22146) doesn't seem to use either of them to me. It is only about reading files with special chars.

dongjoon-hyun · 2018-01-16T18:04:46Z

Oh, I thought you mentioned the suite, @mgaido91 . Sorry! I agree with you.

dongjoon-hyun · 2018-01-16T18:05:01Z

Anyway, Jenkins seems to be out of order now.

SparkQA · 2018-01-16T20:48:01Z

Test build #86183 has finished for PR 20266 at commit fb708b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-16T21:39:53Z

Test build #86184 has finished for PR 20266 at commit 8fec65b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-01-16T22:28:58Z

LGTM

gatorsmile · 2018-01-17T02:10:38Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+  }
+
+  Seq("orc", "parquet", "csv", "json").foreach { format =>
+    test(s"SPARK-23072 Write and read back unicode schema - $format") {


unicode schema -> unicode column names

gatorsmile · 2018-01-17T02:12:50Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+    }
+  }
+
+  Seq("orc", "parquet").foreach { format =>


Only these two formats support it? If so, please add the comments.

This is the same comment to the other test cases. Otherwise, add all of them for each test case.

You can define a global Seq to include all the built-in file formats we support.

Thanks!

Only two support this. I added comments.

For the other test cases, I did.

I added a global Seq, allFileBasedDataSources.

SparkQA · 2018-01-17T06:22:49Z

Test build #86227 has finished for PR 20266 at commit c67809c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-17T06:33:49Z

thanks, merging to master/2.3!

…a sources ## What changes were proposed in this pull request? After [SPARK-20682](#19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`. This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71). ## How was this patch tested? Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20266 from dongjoon-hyun/SPARK-23072. (cherry picked from commit a0aedb0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2018-01-17T07:34:00Z

Thank you, @cloud-fan , @gatorsmile , and @mgaido91 !

[SPARK-23072][SQL] Add a Unicode schema test for file-based data sources

f9a35f1

dongjoon-hyun changed the title ~~[SPARK-23072][SQL] Add a Unicode schema test for file-based data sources~~ [SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources Jan 14, 2018

cloud-fan reviewed Jan 15, 2018

View reviewed changes

dongjoon-hyun added 2 commits January 15, 2018 12:06

Create FileBasedDataSourceSuite.

60b8e43

Move to under sql package.

144c596

cloud-fan reviewed Jan 16, 2018

View reviewed changes

Address comments.

5afaa28

dongjoon-hyun added 2 commits January 16, 2018 10:09

Move SPARK-22146, too.

fb708b7

Remove unused imports.

8fec65b

gatorsmile reviewed Jan 17, 2018

View reviewed changes

Address comments.

c67809c

asfgit closed this in a0aedb0 Jan 17, 2018

dongjoon-hyun deleted the SPARK-23072 branch January 17, 2018 07:34

[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266

[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266

Uh oh!

Conversation

dongjoon-hyun commented Jan 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 14, 2018

Uh oh!

dongjoon-hyun commented Jan 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2018

Uh oh!

SparkQA commented Jan 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

cloud-fan commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

mgaido91 commented Jan 16, 2018

Uh oh!

dongjoon-hyun commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

dongjoon-hyun commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Jan 16, 2018

Uh oh!

dongjoon-hyun commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

mgaido91 commented Jan 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

cloud-fan commented Jan 17, 2018

Uh oh!

dongjoon-hyun commented Jan 17, 2018

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 14, 2018 •

edited

Loading

dongjoon-hyun commented Jan 16, 2018 •

edited

Loading

dongjoon-hyun commented Jan 16, 2018 •

edited

Loading