[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

dongjoon-hyun · 2017-08-18T07:31:48Z

What changes were proposed in this pull request?

ORC filter push-down is disabled by default from the beginning, SPARK-2883

Now, Apache Spark starts to depend on Apache ORC 1.4.0. For Apache Spark 2.3, this PR turns on ORC filter push-down by default like Parquet (SPARK-9207) as a part of SPARK-20901, "Feature parity for ORC with Parquet".

How was this patch tested?

Pass the Jenkins with the existing tests.

SparkQA · 2017-08-18T10:07:30Z

Test build #80834 has finished for PR 18991 at commit 2bc2b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-18T19:28:41Z

Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin .
Could you review this? This will help our ORC transition much more for next 3 months.
If you don't want this, you can turn off this back before Apache ORC 2.3 release.

dongjoon-hyun · 2017-08-21T18:05:39Z

Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you review this?

dongjoon-hyun · 2017-08-21T18:06:00Z

Retest this please.

SparkQA · 2017-08-21T20:53:12Z

Test build #80934 has finished for PR 18991 at commit 2bc2b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-23T12:40:54Z

Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you reivew this ORC predicate pushdown configuration PR when you have sometime?
Thank you in advance!

dongjoon-hyun · 2017-08-25T03:14:45Z

Retest this please.

dongjoon-hyun · 2017-08-25T03:20:51Z

Hi, @cloud-fan, @gatorsmile , @sameeragarwal , @rxin , @mridulm .
Could you reivew this one liner PR about ORC PPD configuration?

SparkQA · 2017-08-25T05:50:57Z

Test build #81114 has finished for PR 18991 at commit 2bc2b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-25T16:04:33Z

Hi, @gatorsmile .
Could you review this ORC PPD default configuration? Our data source doesn't trust any data sources including Parquet/ORC. I think ORC PPD do no harm on Spark.

gatorsmile · 2017-08-25T16:11:24Z

If ORC incorrectly filters out the extra rows, we might get incorrect results. In addition, we do not know whether the push down could get the performance gain. We saw the performance regression when we push down the filters to Parquet in some cases.

To ensure the code quality and result correctness, we need to port the end-to-end test cases from Apache ORC/Parquet. This will also help the community in the long term. If you have a bandwidth, you can make an attempt. These tests should cover most built-in data sources. I will review them at first.

Also cc @cloud-fan @sameeragarwal

dongjoon-hyun · 2017-08-25T16:15:11Z

Thank you for the comments and directions. Definitely, I'll try!
Since we depends on Apache Spark 1.4.0, I think I can add raw level test case somewhere for evaluation purpose only.

gatorsmile · 2017-08-25T16:23:06Z

Since I saw you are also working on the enhancement of ORC reader/writer, we need to check all the limits (e.g., value ranges). I am not sure how good Apache ORC/Parquet did in their test case coverage. Hopefully, they already have good enough end-to-end test cases, and then we can directly import them. Otherwise, we have to build our own framework for it.

For examples, below is the limit of DB2 z/OS.
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_limits.html

dongjoon-hyun · 2017-08-25T16:26:00Z

Wow. It's real commercial spec. Thank you! I understand.
I'm looking ORC Github first.

gatorsmile · 2017-08-25T16:36:38Z

Yes. The commercial DBMS products have a very good/comprehensive test coverage. So far, it is missing in Apache Spark. Basically, we simply trust the underlying data sources, which are maintained by separate communities. Some of them are good, but the others might not be stable enough.

Without a good enough test case coverage, anything we did is risky, especially when we upgrading the releases of these built-in data sources. If the bug is from Apache Spark, we can simply fix it. However, if ORC/Parquet has a serious bug, we are unable to fix them for our users. Thus, a comprehensive test case coverage improvement is a must to have for Apache Spark.

dongjoon-hyun · 2017-08-25T16:41:06Z

+1, I cannot agree anymore.

dongjoon-hyun · 2017-08-26T06:30:55Z

Hi, @gatorsmile . orc_create.q and orc_people_create.txt seems to be used before for some end-to-end test.

Do you think SQLQueryTestSuite is suitable for the datasource end-to-end test in general? Of course, Predicate Push Down should be handled differently.

dongjoon-hyun · 2017-08-26T19:27:44Z

Hi, @gatorsmile .
I made #19060 to tap on the direction. Could you review that?

gatorsmile · 2017-08-27T06:37:11Z

orc_create.q and orc_people_create.txt are from Hive.

Writing test cases is pretty time consuming. I still hope we can get the test cases from the other open source project, instead of writing all of them by ourselves.

dongjoon-hyun · 2017-09-01T16:42:56Z

Hi, @gatorsmile , @cloud-fan, @rxin , and @omalley .

#19060 shows that the behavior of Apache ORC 1.4.0 predicate push-down is correct. #19060 will add more test cases for data source certification. Especially, if you want me to add more test cases on ORC predicate push-down, please let me know.

So, back to the original this issue, I'm not aware of the old case which old ORC incorrectly filters out the extra rows, but new Apache ORC 1.4.0 looks ready for this now. Can we turn on ORC predicate push-down by default in Apache Spark?

Enabling by default will give more opportunity for users to test it before Apache Spark 2.3.0 (on December). I'm sure that Apache ORC community will help us to ensure this feature and to get benefits from this feature.

dongjoon-hyun · 2017-09-05T16:57:13Z

Hi, @liancheng , @gatorsmile , @cloud-fan , @rxin , and @omalley .
I think you are the best people about ORC predicate pushdown issue.
Could you review this PR to turn on ORC PPD by default?

gatorsmile · 2017-09-05T17:28:35Z

Left a few comments in the another PR: #19060 (comment).

I think it is a right time to improve the test case coverage before turning ORC PPD on by default. We can target the ORC PPD to 2.3.0.

dongjoon-hyun · 2017-09-05T17:32:05Z

The test case coverage parity between Parquet and ORC should be the criteria for this, right?

dongjoon-hyun · 2017-09-05T17:33:31Z

To be more clear, I mean the existing Parquet coverage in Apache Spark code base.

gatorsmile · 2017-09-05T17:39:16Z

To avoid duplicating the efforts, we should have a unified testing framework for covering the PPD of all the sources. Parquet and ORC should be part of it. In the future, when we add the other sources with PPD capability, we can directly plug them in.

dongjoon-hyun · 2017-09-05T17:42:45Z

Is the plan aligned with the ongoing Data Source V2 ?

gatorsmile · 2017-09-05T17:49:20Z

If you want to hold, we can wait for the completion of data source API v2. Otherwise, we can start it now and change it if needed.

Conceptually, the test coverage improvement should be not related to how we implement the source APIs.

dongjoon-hyun · 2017-09-05T17:59:50Z

Of course, I want to proceed in any part of ORC!
As you know, I tried many trials to get a chance to be reviewed.
Some PR gets it, but the other ORC PR like #18953 didn't get a feedback for recent two weeks.
I'm also preparing Data Source V2, but I'm concerning about the big picture/roadmap/priority of Apache Spark PMC members. ORC looks too small to get attentions.

dongjoon-hyun · 2017-09-06T08:04:09Z

Maybe, this seems not a scope on Apache Spark 2.3.0 because it's a debut of Apache ORC 1.4.0. I'll close this PR. Thank you all for giving advice on this PR.

dongjoon-hyun · 2018-01-09T19:11:19Z

I reopen it to re-test the master branch with this option before Apache Spark 2.3.

SparkQA · 2018-01-09T22:36:04Z

Test build #85868 has finished for PR 18991 at commit 2bc2b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-10T03:31:31Z

why it's still WIP?

dongjoon-hyun · 2018-01-10T03:41:54Z

Ur, originally, it's not accepted by @gatorsmile due to lack of test cases.
So, Today, I reopen it for testing purpose.
Do you think we can enable it? I think we can.

gatorsmile · 2018-01-10T03:54:38Z

I expect we can port more test cases to Spark, instead of relying on quality assurance of external data sources.

gatorsmile · 2018-01-10T03:57:46Z

Any perf number ?

cloud-fan · 2018-01-10T07:06:19Z

I'd expect orc has same test coverage as parquet, is it true?

dongjoon-hyun · 2018-01-10T15:36:44Z

Yes, @cloud-fan . I added the same test coverage for ORC in Apache Spark.
Sorry, @gatorsmile . I always turned on PPD, so there is no perf number for PPD=false.

gatorsmile · 2018-01-10T16:44:09Z

What is the performance number when turning it on, compared with the off mode? Do we have a micro-benchmark suite?

dongjoon-hyun · 2018-01-10T17:47:03Z

@gatorsmile . I don't have any numbers for PPD=false.
For ORCReadBenchmark, we use PPD=true here. You can turn off it to false, but generally it's not designed for PPD perf benchmark.

Do you want to add some?

cloud-fan · 2018-01-11T10:20:58Z

Yea let's add some, I'm curious to see how well PPD works in ORC, since for parquet PPD doesn't work well and we disable record level filtering for parquet.

dongjoon-hyun · 2018-01-11T17:25:25Z

Ur, it's not record-level filtering. Maybe, it's because I explained it too abstractly here. It's stripe-level. So, the current ORC in Spark works in the same way with the current Parquet's behavior in Spark. Spark's assumption is just giving a hint to underlying data formats with spark.sql.(orc|parquet).filterPushdown and do the filtering later inside Spark.

If both of you requires that, let's revisit later in 2.4 timeframe. When I reopened this two days ago, the purpose is just to make it sure for that option in Apache.

cloud-fan · 2018-01-12T16:19:19Z

I think we still have time for 2.3? I'm not worried about correctness, but we should show people how much it improves.

dongjoon-hyun · 2018-01-12T18:41:27Z

Oh, do we have time for 2.3?

cloud-fan · 2018-01-13T02:29:02Z

conventionally rc1 would fail so we still have time :)

gatorsmile · 2018-01-13T18:19:59Z

Yes. Please work on the perf tests and the benchmark test suite.

I think the priority of this PR is much higher than the test-only PR you are working on

dongjoon-hyun · 2018-01-14T08:24:32Z

I see. I opened a new PR, #20265, for that.

[SPARK-21783][SQL] Turn on ORC filter push-down by default

2bc2b17

dongjoon-hyun mentioned this pull request Aug 20, 2017

[SPARK-21783][SQL]Turn on ORC filter push-down by default #19007

Closed

dongjoon-hyun mentioned this pull request Sep 5, 2017

[WIP][SQL] Add DataSourceSuite validating data sources limitations #19060

Closed

dongjoon-hyun closed this Sep 6, 2017

dongjoon-hyun deleted the SPARK-21783 branch September 6, 2017 08:04

dongjoon-hyun changed the title ~~[SPARK-21783][SQL] Turn on ORC filter push-down by default~~ [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by default Jan 9, 2018

dongjoon-hyun restored the SPARK-21783 branch January 9, 2018 19:10

dongjoon-hyun reopened this Jan 9, 2018

dongjoon-hyun changed the title ~~[SPARK-21783][SQL][WIP] Turn on ORC filter push-down by default~~ [SPARK-21783][SQL] Turn on ORC filter push-down by default Jan 10, 2018

dongjoon-hyun closed this Jan 11, 2018

dongjoon-hyun deleted the SPARK-21783 branch January 11, 2018 17:25

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

Uh oh!

Conversation

dongjoon-hyun commented Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

dongjoon-hyun commented Aug 18, 2017

Uh oh!

dongjoon-hyun commented Aug 21, 2017

Uh oh!

dongjoon-hyun commented Aug 21, 2017

Uh oh!

SparkQA commented Aug 21, 2017

Uh oh!

dongjoon-hyun commented Aug 23, 2017

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

gatorsmile commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

gatorsmile commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 26, 2017

Uh oh!

gatorsmile commented Aug 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 1, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017

Uh oh!

gatorsmile commented Sep 5, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017

Uh oh!

gatorsmile commented Sep 5, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 5, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 6, 2017

Uh oh!

dongjoon-hyun commented Jan 9, 2018

Uh oh!

SparkQA commented Jan 9, 2018

Uh oh!

cloud-fan commented Jan 10, 2018

Uh oh!

dongjoon-hyun commented Aug 18, 2017 •

edited

Loading

gatorsmile commented Aug 25, 2017 •

edited

Loading

gatorsmile commented Aug 25, 2017 •

edited

Loading

dongjoon-hyun commented Aug 25, 2017 •

edited

Loading

dongjoon-hyun commented Aug 26, 2017 •

edited

Loading

gatorsmile commented Aug 27, 2017 •

edited

Loading

dongjoon-hyun commented Sep 5, 2017 •

edited

Loading

dongjoon-hyun commented Sep 5, 2017 •

edited

Loading

gatorsmile commented Jan 10, 2018 •

edited

Loading

gatorsmile commented Jan 10, 2018 •

edited

Loading

dongjoon-hyun commented Jan 10, 2018 •

edited

Loading

dongjoon-hyun commented Jan 14, 2018 •

edited

Loading