[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

dongjoon-hyun · 2017-08-26T19:26:55Z

What changes were proposed in this pull request?

This test suite aims to test if underlying data sources supports Spark data values limits and PPD.

How was this patch tested?

Pass the Jenkins.

…ions

dongjoon-hyun · 2017-08-26T19:28:48Z

Hi, @gatorsmile .
This is the poc on new ORC data source. Please note that old ORC uses different packages and apis.

SparkQA · 2017-08-26T20:47:01Z

Test build #81155 has finished for PR 19060 at commit 0bf9ad2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-26T23:25:07Z

Test build #81157 has finished for PR 19060 at commit 16031ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-27T05:17:43Z

Does ORC/Parquet have the related test cases? Could we just port them?

dongjoon-hyun · 2017-08-27T05:27:24Z

For ORC, I was able to find some random-generated float/double tests, but there is no value-limit test. For Parquet, I'm not sure.

dongjoon-hyun · 2017-08-27T05:28:35Z

For predicate push-down, I ported ORC code already in this PR.

dongjoon-hyun · 2017-08-27T05:39:15Z

BTW, from the text-based format, I thought we can include JSON at the beginning. But, unfortunatly, I found that currently Spark use BIGINT for all numeric, DOUBLE for float/double, and STRING for DATE/TIMESTAMP/STRING. Simply, checkValue fails. I think we may cover Parquet and ORC only.

gatorsmile · 2017-08-27T06:33:07Z

The ORC test is very specific to the impl. No end-to-end test cases?

dongjoon-hyun · 2017-08-27T06:45:28Z

In ORC GitHub, up to my knowledge, this is the highest level.

val reader = new OrcInputFormat[OrcStruct]().createRecordReader(split, attemptContext)
... reader.nextKeyValue()
... reader.getCurrentValue
... row.getFieldValue(0).asInstanceOf[IntWritable].get

Actually, I have one idea. If we support unhandledFilters in data source testing by SQLConf, we can do in Spark level like value limit case. How do you think about that? May I try that way?

gatorsmile · 2017-08-27T06:47:12Z

How about Parquet and the others?

dongjoon-hyun · 2017-08-27T06:53:48Z

Parquet is the same. We can ~~use~~ implement unhandledFilters for removing Spark-side filter in PPD testing.
I think that the others text-based data sources(TEXT/CSV/JSON) doesn't support PPD.

dongjoon-hyun · 2017-08-27T07:00:26Z

If you agree, I will try to write more code here as POC.

gatorsmile · 2017-08-27T16:13:36Z

I mean, how about Parquet and the others? Do they have the e2e test cases in their projects?

dongjoon-hyun · 2017-08-27T18:24:21Z

For Parquet, I can find TestInputOutputFormat.java. Parquet also has a test case which is very specific to the impl, too.

What do you mean by e2e test case not specific to the impl. exactly? Sorry, but could you provide an example?

dongjoon-hyun · 2017-08-31T21:58:56Z

Previous parquet link is broken. The official one is https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/example/TestInputOutputFormat.java

SparkQA · 2017-08-31T23:51:49Z

Test build #81298 has finished for PR 19060 at commit 104f24c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-01T16:20:47Z

Hi, @gatorsmile .
I make it to use more higher level in test("orc - predicate push down".
How do you think about this approach?

SparkQA · 2017-09-01T18:18:22Z

Test build #81318 has finished for PR 19060 at commit 4466325.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-03T19:24:06Z

Hi, @gatorsmile .
I believe I understand your advice correctly at this time.
Could you take a look at this data source tester PR?

SparkQA · 2017-09-03T21:20:05Z

Test build #81365 has finished for PR 19060 at commit fb72c89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-05T07:29:04Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/DataSourceSuite.scala

+            .save(dir.getCanonicalPath)
+
+          val df = spark.read.format(dataSource).load(dir.getCanonicalPath)
+            .where(s"i BETWEEN 1500 AND 1999")


This has been tested in OrcFilterSuite and ParquetFilterSuite, right? What is the goal of this test case?

OrcFilterSuite and ParquetFilterSuite does not test stripe level filtering in a single file. It only filters out the file-level filtering, doesn't it?

This test suite is designed to prove that PPD on ORC data source doesn't filter out wrongly. Since ORC does not filter row-by-row, this PR check the all values of the returned stripe. All values in the stripe exist correctly in this test case.

IIRC, this is what you requested in SPARK-21783 Turn on ORC filter push-down by default. If OrcFilterSuite has been complete, I'm wondering what is the reason to block turning on ORC PPD by default.

I do not know why we did not turn it on by default. To me, to turn it on, we need to improve the coverage.

So far, the coverage of OrcFilterSuite and ParquetFilterSuite are not good. They only have very basic checks. Could you improve them?

For example, adding the boundary values for these predicate pushdown in both sides? We need to ensure whether the predicates are pushed down, executed in the underlying data sources, and the filters work properly (value comparison).

I agree with you on improving the coverage, @gatorsmile . So, I've started this and have been trying to improve this in order to find a way what you want.

Given that all the situations are the same with Parquet and ORC, what I suggested on #18991 is to turn on ORC PPD by default at least before 2.3.0 release. How do you think about that?

Parquet is the default format. It is being used by most our Spark users. We already got many related JIRA issues and then fixed/blocked them. You were also involved in some of these PRs.

To avoid repeating the same issue in ORC, we should improve the coverage before turning on ORC predicate pushdown by default.

Every JIRA issue in Apache Spark should have a test case to be resolved. If then, we can focus on ORC test case parity for Parquet. Adding more test cases is beyond of that scope.

We already got many related JIRA issues and then fixed/blocked them.

Yes. ORC is not a default format. Even worse, although a user try to use '.orc(...)`, PPD is not used by default, too. We cannot find any issues on PPD. It doesn't form a virtuous cycle at all. To be honest with you, I trust the efforts on Apache ORC community.

I have different philosophy. I worked for mission-critical enterprise software. I do not trust anybody's codes unless they can pass the corresponding tests.

The initial ORC PPD does not have enough test coverage. To enable it by default, we should improve the test coverage now. The new test case framework for PPD should cover all the sources. We can have source-speicific ones, if it is not applicable to the other sources.

Yep. This PR is based on the following your guidelines and I wish that it will continue to evolve. :)

Handle all data sources (For value ranges, I tested on Spark min/max instead of data source min/max)

Use high-level end-to-end test case

For PPD, Parquet and ORC are the only ones supporting PPD, aren't they?

What we didn't agree is only the corresponding tests.

So far, Parquet and ORC are the only built-in sources that support PPD. From the viewpoints of software development, we should make the test framework easily extensible.

Sorry, I do not think it covers what we need. First, to verify whether the underlying data source works properly, we need to have the corresponding (boundary) data (which are inserted to the sources), the predicates (which are using boundary values), and also need to ensure these predicates are pushed down and not evaluated by Spark.

dongjoon-hyun · 2017-09-10T00:50:49Z

Today, I've re-considered this PR from the bottom and from the beginning.
To sum up, I realized that I've been wasting committer's review time. Especially, sorry for your effort, @gatorsmile !

As you recommended, I had better focus on individual test suites more.

So far, the coverage of OrcFilterSuite and ParquetFilterSuite are not good.
They only have very basic checks. Could you improve them?

I'll close this PR for now. Thank you for giving me your advice!

[SPARK-XXX][SQL] Add DataSource suite validating data sources limitat…

0bf9ad2

…ions

dongjoon-hyun mentioned this pull request Aug 26, 2017

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

Closed

dongjoon-hyun changed the title ~~[WIP][SQL] Add DataSource suite validating data sources limitations~~ [WIP][SQL] Add DataSourceSuite validating data sources limitations Aug 26, 2017

Use TestHiveSingleton.

16031ac

Add JSON/CSV format and split user-defined schema and inferSchema.

104f24c

Update ORC PPD test case more general.

4466325

Address comments.

fb72c89

gatorsmile reviewed Sep 5, 2017

View reviewed changes

dongjoon-hyun closed this Sep 10, 2017

dongjoon-hyun deleted the SPARK-DS-SUITE branch September 10, 2017 00:50

[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

Uh oh!

Conversation

dongjoon-hyun commented Aug 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Aug 26, 2017

Uh oh!

SparkQA commented Aug 26, 2017

Uh oh!

SparkQA commented Aug 26, 2017

Uh oh!

gatorsmile commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

gatorsmile commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

gatorsmile commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

gatorsmile commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

dongjoon-hyun commented Aug 31, 2017

Uh oh!

SparkQA commented Aug 31, 2017

Uh oh!

dongjoon-hyun commented Sep 1, 2017

Uh oh!

SparkQA commented Sep 1, 2017

Uh oh!

dongjoon-hyun commented Sep 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 27, 2017 •

edited

Loading

dongjoon-hyun commented Sep 3, 2017 •

edited

Loading

dongjoon-hyun Sep 5, 2017 •

edited

Loading

dongjoon-hyun commented Sep 10, 2017 •

edited

Loading