Skip to content

[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

wants to merge 5 commits into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This test suite aims to test if underlying data sources supports Spark data values limits and PPD.

How was this patch tested?

Pass the Jenkins.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
This is the poc on new ORC data source. Please note that old ORC uses different packages and apis.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SQL] Add DataSource suite validating data sources limitations [WIP][SQL] Add DataSourceSuite validating data sources limitations Aug 26, 2017
@SparkQA
Copy link

SparkQA commented Aug 26, 2017

Test build #81155 has finished for PR 19060 at commit 0bf9ad2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2017

Test build #81157 has finished for PR 19060 at commit 16031ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Does ORC/Parquet have the related test cases? Could we just port them?

@dongjoon-hyun
Copy link
Member Author

For ORC, I was able to find some random-generated float/double tests, but there is no value-limit test. For Parquet, I'm not sure.

@dongjoon-hyun
Copy link
Member Author

For predicate push-down, I ported ORC code already in this PR.

@dongjoon-hyun
Copy link
Member Author

BTW, from the text-based format, I thought we can include JSON at the beginning. But, unfortunatly, I found that currently Spark use BIGINT for all numeric, DOUBLE for float/double, and STRING for DATE/TIMESTAMP/STRING. Simply, checkValue fails. I think we may cover Parquet and ORC only.

@gatorsmile
Copy link
Member

The ORC test is very specific to the impl. No end-to-end test cases?

@dongjoon-hyun
Copy link
Member Author

In ORC GitHub, up to my knowledge, this is the highest level.

val reader = new OrcInputFormat[OrcStruct]().createRecordReader(split, attemptContext)
... reader.nextKeyValue()
... reader.getCurrentValue
... row.getFieldValue(0).asInstanceOf[IntWritable].get

Actually, I have one idea. If we support unhandledFilters in data source testing by SQLConf, we can do in Spark level like value limit case. How do you think about that? May I try that way?

@gatorsmile
Copy link
Member

How about Parquet and the others?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 27, 2017

Parquet is the same. We can use implement unhandledFilters for removing Spark-side filter in PPD testing.
I think that the others text-based data sources(TEXT/CSV/JSON) doesn't support PPD.

@dongjoon-hyun
Copy link
Member Author

If you agree, I will try to write more code here as POC.

@gatorsmile
Copy link
Member

I mean, how about Parquet and the others? Do they have the e2e test cases in their projects?

@dongjoon-hyun
Copy link
Member Author

For Parquet, I can find TestInputOutputFormat.java. Parquet also has a test case which is very specific to the impl, too.

What do you mean by e2e test case not specific to the impl. exactly? Sorry, but could you provide an example?

@dongjoon-hyun
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81298 has finished for PR 19060 at commit 104f24c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
I make it to use more higher level in test("orc - predicate push down".
How do you think about this approach?

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81318 has finished for PR 19060 at commit 4466325.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 3, 2017

Hi, @gatorsmile .
I believe I understand your advice correctly at this time.
Could you take a look at this data source tester PR?

@SparkQA
Copy link

SparkQA commented Sep 3, 2017

Test build #81365 has finished for PR 19060 at commit fb72c89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.save(dir.getCanonicalPath)

val df = spark.read.format(dataSource).load(dir.getCanonicalPath)
.where(s"i BETWEEN 1500 AND 1999")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been tested in OrcFilterSuite and ParquetFilterSuite, right? What is the goal of this test case?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcFilterSuite and ParquetFilterSuite does not test stripe level filtering in a single file. It only filters out the file-level filtering, doesn't it?

This test suite is designed to prove that PPD on ORC data source doesn't filter out wrongly. Since ORC does not filter row-by-row, this PR check the all values of the returned stripe. All values in the stripe exist correctly in this test case.

IIRC, this is what you requested in SPARK-21783 Turn on ORC filter push-down by default. If OrcFilterSuite has been complete, I'm wondering what is the reason to block turning on ORC PPD by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know why we did not turn it on by default. To me, to turn it on, we need to improve the coverage.

So far, the coverage of OrcFilterSuite and ParquetFilterSuite are not good. They only have very basic checks. Could you improve them?

For example, adding the boundary values for these predicate pushdown in both sides? We need to ensure whether the predicates are pushed down, executed in the underlying data sources, and the filters work properly (value comparison).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on improving the coverage, @gatorsmile . So, I've started this and have been trying to improve this in order to find a way what you want.

Given that all the situations are the same with Parquet and ORC, what I suggested on #18991 is to turn on ORC PPD by default at least before 2.3.0 release. How do you think about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet is the default format. It is being used by most our Spark users. We already got many related JIRA issues and then fixed/blocked them. You were also involved in some of these PRs.

To avoid repeating the same issue in ORC, we should improve the coverage before turning on ORC predicate pushdown by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Every JIRA issue in Apache Spark should have a test case to be resolved. If then, we can focus on ORC test case parity for Parquet. Adding more test cases is beyond of that scope.

    We already got many related JIRA issues and then fixed/blocked them.

  2. Yes. ORC is not a default format. Even worse, although a user try to use '.orc(...)`, PPD is not used by default, too. We cannot find any issues on PPD. It doesn't form a virtuous cycle at all. To be honest with you, I trust the efforts on Apache ORC community.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have different philosophy. I worked for mission-critical enterprise software. I do not trust anybody's codes unless they can pass the corresponding tests.

The initial ORC PPD does not have enough test coverage. To enable it by default, we should improve the test coverage now. The new test case framework for PPD should cover all the sources. We can have source-speicific ones, if it is not applicable to the other sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This PR is based on the following your guidelines and I wish that it will continue to evolve. :)

  • Handle all data sources (For value ranges, I tested on Spark min/max instead of data source min/max)
  • Use high-level end-to-end test case

For PPD, Parquet and ORC are the only ones supporting PPD, aren't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we didn't agree is only the corresponding tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, Parquet and ORC are the only built-in sources that support PPD. From the viewpoints of software development, we should make the test framework easily extensible.

Sorry, I do not think it covers what we need. First, to verify whether the underlying data source works properly, we need to have the corresponding (boundary) data (which are inserted to the sources), the predicates (which are using boundary values), and also need to ensure these predicates are pushed down and not evaluated by Spark.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 10, 2017

Today, I've re-considered this PR from the bottom and from the beginning.
To sum up, I realized that I've been wasting committer's review time. Especially, sorry for your effort, @gatorsmile !

As you recommended, I had better focus on individual test suites more.

So far, the coverage of OrcFilterSuite and ParquetFilterSuite are not good.
They only have very basic checks. Could you improve them?

I'll close this PR for now. Thank you for giving me your advice!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-DS-SUITE branch September 10, 2017 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants