-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, @gatorsmile . |
Test build #81155 has finished for PR 19060 at commit
|
Test build #81157 has finished for PR 19060 at commit
|
Does ORC/Parquet have the related test cases? Could we just port them? |
For ORC, I was able to find some random-generated float/double tests, but there is no value-limit test. For Parquet, I'm not sure. |
For predicate push-down, I ported ORC code already in this PR. |
BTW, from the text-based format, I thought we can include JSON at the beginning. But, unfortunatly, I found that currently Spark use BIGINT for all numeric, DOUBLE for float/double, and STRING for DATE/TIMESTAMP/STRING. Simply, |
The ORC test is very specific to the impl. No end-to-end test cases? |
In ORC GitHub, up to my knowledge, this is the highest level.
Actually, I have one idea. If we support unhandledFilters in data source testing by SQLConf, we can do in Spark level like value limit case. How do you think about that? May I try that way? |
How about Parquet and the others? |
Parquet is the same. We can |
If you agree, I will try to write more code here as POC. |
I mean, how about Parquet and the others? Do they have the e2e test cases in their projects? |
For Parquet, I can find TestInputOutputFormat.java. Parquet also has a test case which is very specific to the impl, too. What do you mean by |
Previous parquet link is broken. The official one is https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/example/TestInputOutputFormat.java |
Test build #81298 has finished for PR 19060 at commit
|
Hi, @gatorsmile . |
Test build #81318 has finished for PR 19060 at commit
|
Hi, @gatorsmile . |
Test build #81365 has finished for PR 19060 at commit
|
.save(dir.getCanonicalPath) | ||
|
||
val df = spark.read.format(dataSource).load(dir.getCanonicalPath) | ||
.where(s"i BETWEEN 1500 AND 1999") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been tested in OrcFilterSuite
and ParquetFilterSuite
, right? What is the goal of this test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrcFilterSuite
and ParquetFilterSuite
does not test stripe
level filtering in a single file. It only filters out the file-level filtering, doesn't it?
This test suite is designed to prove that PPD on ORC data source doesn't filter out wrongly. Since ORC does not filter row-by-row, this PR check the all values of the returned stripe. All values in the stripe exist correctly in this test case.
IIRC, this is what you requested in SPARK-21783 Turn on ORC filter push-down by default. If OrcFilterSuite
has been complete, I'm wondering what is the reason to block turning on ORC PPD
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know why we did not turn it on by default. To me, to turn it on, we need to improve the coverage.
So far, the coverage of OrcFilterSuite
and ParquetFilterSuite
are not good. They only have very basic checks. Could you improve them?
For example, adding the boundary values for these predicate pushdown in both sides? We need to ensure whether the predicates are pushed down, executed in the underlying data sources, and the filters work properly (value comparison).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you on improving the coverage, @gatorsmile . So, I've started this and have been trying to improve this in order to find a way what you want.
Given that all the situations are the same with Parquet and ORC, what I suggested on #18991 is to turn on ORC PPD by default at least before 2.3.0 release. How do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet is the default format. It is being used by most our Spark users. We already got many related JIRA issues and then fixed/blocked them. You were also involved in some of these PRs.
To avoid repeating the same issue in ORC, we should improve the coverage before turning on ORC predicate pushdown by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Every JIRA issue in Apache Spark should have a test case to be resolved. If then, we can focus on ORC test case parity for Parquet. Adding more test cases is beyond of that scope.
We already got many related JIRA issues and then fixed/blocked them.
-
Yes. ORC is not a default format. Even worse, although a user try to use '.orc(...)`, PPD is not used by default, too. We cannot find any issues on PPD. It doesn't form a virtuous cycle at all. To be honest with you, I trust the efforts on Apache ORC community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have different philosophy. I worked for mission-critical enterprise software. I do not trust anybody's codes unless they can pass the corresponding tests.
The initial ORC PPD does not have enough test coverage. To enable it by default, we should improve the test coverage now. The new test case framework for PPD should cover all the sources. We can have source-speicific ones, if it is not applicable to the other sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. This PR is based on the following your guidelines and I wish that it will continue to evolve. :)
- Handle all data sources (For value ranges, I tested on Spark min/max instead of data source min/max)
- Use high-level end-to-end test case
For PPD, Parquet and ORC are the only ones supporting PPD, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we didn't agree is only the corresponding tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, Parquet and ORC are the only built-in sources that support PPD. From the viewpoints of software development, we should make the test framework easily extensible.
Sorry, I do not think it covers what we need. First, to verify whether the underlying data source works properly, we need to have the corresponding (boundary) data (which are inserted to the sources), the predicates (which are using boundary values), and also need to ensure these predicates are pushed down and not evaluated by Spark.
Today, I've re-considered this PR from the bottom and from the beginning. As you recommended, I had better focus on individual test suites more.
I'll close this PR for now. Thank you for giving me your advice! |
What changes were proposed in this pull request?
This test suite aims to test if underlying data sources supports Spark data values limits and PPD.
How was this patch tested?
Pass the Jenkins.