-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #86122 has finished for PR 20266 at commit
|
cc @gatorsmile and @cloud-fan . |
@@ -2773,4 +2773,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
} | |||
|
|||
Seq("orc", "parquet", "csv", "json").foreach { format => | |||
test(s"Write and read back unicode schema - $format") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of keeping adding test cases in SQLQuerySuite
, shall we create a dedicate test suite for file based data source now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. That's a best idea. I'll update like that.
Test build #86140 has finished for PR 20266 at commit
|
Test build #86141 has finished for PR 20266 at commit
|
Seq("orc", "parquet", "csv", "json", "text").foreach { format => | ||
test(s"Writing empty datasets should not fail - $format") { | ||
withTempDir { dir => | ||
Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why add /tmp
at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It's fixed by using withTempPath
.
} | ||
|
||
// Only New OrcFileFormat supports this | ||
Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.orc.impl
is native by default, can we just use "orc" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
LGTM |
Test build #86159 has finished for PR 20266 at commit
|
retest this please |
Test build #86162 has finished for PR 20266 at commit
|
nit: since we are creating a new test suite what about moving also https://github.com/dongjoon-hyun/spark/blob/5afaa2836133cfc18a52de38d666817991d62c5d/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala#L1347 there? |
Retest this please |
Test build #86178 has started for PR 20266 at commit |
@mgaido91 . That suite is using SQL Syntax and Hive metastore. Here, it's only using in-memory catalog. |
@dongjoon-hyun the test case I referred to (the one related to SPARK-22146) doesn't seem to use either of them to me. It is only about reading files with special chars. |
Oh, I thought you mentioned the suite, @mgaido91 . Sorry! I agree with you. |
Anyway, Jenkins seems to be out of order now. |
Test build #86183 has finished for PR 20266 at commit
|
Test build #86184 has finished for PR 20266 at commit
|
LGTM |
} | ||
|
||
Seq("orc", "parquet", "csv", "json").foreach { format => | ||
test(s"SPARK-23072 Write and read back unicode schema - $format") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unicode schema
-> unicode column names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
} | ||
} | ||
|
||
Seq("orc", "parquet").foreach { format => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only these two formats support it? If so, please add the comments.
This is the same comment to the other test cases. Otherwise, add all of them for each test case.
You can define a global Seq to include all the built-in file formats we support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
- Only two support this. I added comments.
- For the other test cases, I did.
- I added a global Seq,
allFileBasedDataSources
.
Test build #86227 has finished for PR 20266 at commit
|
thanks, merging to master/2.3! |
…a sources ## What changes were proposed in this pull request? After [SPARK-20682](#19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`. This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71). ## How was this patch tested? Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20266 from dongjoon-hyun/SPARK-23072. (cherry picked from commit a0aedb0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Thank you, @cloud-fan , @gatorsmile , and @mgaido91 ! |
What changes were proposed in this pull request?
After SPARK-20682, Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises
org.apache.spark.sql.catalyst.parser.ParseException
.This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has a single column with a fixed name 'value'.
How was this patch tested?
Pass the newly added test case.