[SPARK-21783][SQL] Turn on ORC filter push-down by default #20265

dongjoon-hyun · 2018-01-14T08:23:34Z

What changes were proposed in this pull request?

ORC filter push-down is disabled by default from the beginning, SPARK-2883.

Now, Apache Spark starts to depend on Apache ORC 1.4.1. For Apache Spark 2.3, this PR turns on ORC filter push-down by default like Parquet (SPARK-9207) as a part of SPARK-20901, "Feature parity for ORC with Parquet".

How was this patch tested?

Pass the existing tests.

dongjoon-hyun · 2018-01-14T08:26:53Z

cc @cloud-fan , @gatorsmile .

SparkQA · 2018-01-14T11:58:03Z

Test build #86116 has finished for PR 20265 at commit dda5bdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-01-14T17:05:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

@@ -483,6 +484,64 @@ object OrcReadBenchmark {
    }
  }

+  def filterPushDownBenchmark(values: Int, width: Int): Unit = {


Filter push-down depends on various properties of data and predicates. This is just an example of filter push down performance in order to show some benefits.

Have you seen any workload that predicate pushdown could be slower?

Theoretically, useless predicates (selectivity 100%) only adds additional computation for both Parquet/ORC.

Could you add a test case for useless predicates too?

Ur, do you expect there will be much difference in some cases?
In general most common cases, it will be slightly slower as we can expect easily.

gatorsmile · 2018-01-14T18:16:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

+      withTempTable("t1", "nativeOrcTable", "hiveOrcTable") {
+        import spark.implicits._
+        val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+        val whereExpr = (1 to width).map(i => s"NOT c$i LIKE '%not%exist%'").mkString(" AND ")


Why not using some simple predicate? Like col > 5?

Is it important for this config PR?
Let’s focus on the original purpose of this PR.

This is kind of the best case for PPD, as the data is sorted. I'm fine with it, but let's add some more cases, at least == and >. We should follow other benchmarks in this file to make it completed.

Ur, @cloud-fan and @gatorsmile .

The best case for PPD is Spark needs to do lots of processing on the returned rows but ORC reader only returns one stripe with minimal CPU code.

So, I designed this benchmark in order to show the difference clearly.

The push-downed predicate is only uniqueID = 0 (minimal). We can change that into uniqueID == or uniqueID >.

LIKE predicate is chosed because it's not pushed down and makes Spark do more processing. It's just one of the example of that kind of operation. You can ignore thoses predicates. We can choose some UDFs instead.

I mean LIKE '%not%exist%' will not be optimized by LikeSimplification.

oh sorry I missed the uniqueID part. So the like operation is just to make the difference larger? We don't need to do this, just a simple predicate like col = 1 or col < 1, to show normally how much PPD improves performance.

The goal of this benchmark is not to show the best case of PPD. We just want to see the perf difference of the most common cases.

I see. @cloud-fan and @gatorsmile .
For the most common cases, I also wondered that for Parquet, too.

cloud-fan · 2018-01-15T02:31:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

+
+        Filter Pushdown:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+        ------------------------------------------------------------------------------------------------
+        Native ORC MR (Pushdown=false)              16169 / 16193          0.3        3084.0       1.0X


let's focus on PPD for this benchmark and not disable vectorized reader. e.g.

col < 3 col < 3 (Pushdown) col = 3 col = 3 (Pushdown) ...

Yep. I see. Focusing on PPD on the best reader.

dongjoon-hyun · 2018-01-16T00:10:46Z

Hi, @cloud-fan and @gatorsmile .
Your questions are valid for all PPD cases. According to the comments, I added the following expressions (positive and negative) for both ORC/Parquet.

+    // Positive cases: Select one or no rows
+    Seq("id = 0", "id == 0", "id <= 0", "id < 1", "id IS NULL").foreach { expr =>
+      filterPushDownBenchmark(1024 * 1024 * 1, 20, expr)
+    }
+
+    // Negative cases: Select all rows which means the predicate is always true.
+    Seq("id > -1", "id != -1", "id IS NOT NULL").foreach { expr =>
+      filterPushDownBenchmark(1024 * 1024 * 1, 20, expr)
+    }

dongjoon-hyun · 2018-01-16T00:15:17Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+        Filter Pushdown (id > -1):               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+        ------------------------------------------------------------------------------------------------
+        Parquet Vectorized                            8346 / 8516          0.1        7959.8       1.0X
+        Parquet Vectorized (Pushdown)                 8611 / 8630          0.1        8212.4       1.0X


@gatorsmile . This shows the case you asked. It happens here for Parquet and happens in Line 169 for ORC.

Thanks for your work! The benchmark suite is pretty useful.

gatorsmile · 2018-01-16T02:41:13Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+
+  def main(args: Array[String]): Unit = {
+    // Positive cases: Select one or no rows
+    Seq("id = 0", "id == 0", "id <= 0", "id < 1", "id IS NULL").foreach { expr =>


"id == 0", "id <= 0", "id < 1" -> "id <= 1024 * 500", "id < 1024 * 500", "id > 1024 * 499 and id < 1024 * 500"

maybe we can use a 10% selectivity predicate?

I'll try to embrace both requests.

gatorsmile · 2018-01-16T02:43:09Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+  def main(args: Array[String]): Unit = {
+    // Positive cases: Select one or no rows
+    Seq("id = 0", "id == 0", "id <= 0", "id < 1", "id IS NULL").foreach { expr =>
+      filterPushDownBenchmark(1024 * 1024 * 1, 20, expr)


1024 * 1024 * 1 -> 1024 * 1024

gatorsmile · 2018-01-16T02:45:05Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+  }
+
+  def main(args: Array[String]): Unit = {
+    // Positive cases: Select one or no rows


Split the positive case to multiple, as suggested above. We need to see the perf for different predicate types.

gatorsmile · 2018-01-16T02:48:29Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+        df.createOrReplaceTempView("t1")
+        prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+        Seq(false, true).foreach { value =>


value -> pushDownEnabled

gatorsmile · 2018-01-16T02:49:33Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")


Do we need it?

I think it's fine, then we don't need to care about what the default value is.

ah we don't need it, we always set them in the benchmark cases.

gatorsmile · 2018-01-16T02:52:33Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+      filterPushDownBenchmark(1024 * 1024 * 1, 20, expr)
+    }
+
+    // Negative cases: Select all rows which means the predicate is always true.


This is not a negative case, conceptually.

maybe good cases vs bad cases?

gatorsmile · 2018-01-16T02:56:32Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+        Filter Pushdown (id != -1):              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+        ------------------------------------------------------------------------------------------------
+        Parquet Vectorized                            8088 / 8297          0.1        7713.2       1.0X
+        Parquet Vectorized (Pushdown)                 7110 / 8674          0.1        6780.8       1.1X


The difference between the best and the avg is big. We need to increase minNumIters

I'll increase from 2 to 5.

cloud-fan · 2018-01-16T03:05:47Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+        Parquet Vectorized                            2267 / 2287          0.5        2162.0       1.0X
+        Parquet Vectorized (Pushdown)                  735 /  803          1.4         701.1       3.1X
+        Native ORC Vectorized                         1708 / 1718          0.6        1629.1       1.3X
+        Native ORC Vectorized (Pushdown)                83 /   88         12.7          79.0      27.4X


this is amazing, any idea why ORC is so much faster than parquet in this case?

cloud-fan · 2018-01-16T03:06:48Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+        Native ORC Vectorized                         1708 / 1718          0.6        1629.1       1.3X
+        Native ORC Vectorized (Pushdown)                83 /   88         12.7          79.0      27.4X
+
+        Filter Pushdown (id == 0):               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative


what's the difference between id = 0 and id == 0? Do you want id <=> 0?

Oops. Yes. <=>.

cloud-fan · 2018-01-16T03:11:43Z

I think we need to make sure parquet row group size and orc strip size is same, to make this benchmark fair.

SparkQA · 2018-01-16T03:28:03Z

Test build #86143 has finished for PR 20265 at commit 440f76b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-01-16T07:21:43Z

I'll update the PR tomorrow.

dongjoon-hyun · 2018-01-16T21:16:15Z

I updated the PR (except one RowGroupSize/OrcStripeSize part).

SparkQA · 2018-01-17T00:34:02Z

Test build #86200 has finished for PR 20265 at commit 87af693.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-17T06:06:12Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+
+  def main(args: Array[String]): Unit = {
+    val numRows = 1024 * 1024
+    val width = 20


I think 20 is not a common table width, how about 5?

No problem. I'll set to 5.

cloud-fan · 2018-01-17T06:12:10Z

LGTM except one comment. Let's worry about row group/stripe size later, since both parquet and orc use default settings, I think it's still fair.

cloud-fan · 2018-01-17T06:12:56Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+  }
+
+  def main(args: Array[String]): Unit = {
+    val numRows = 1024 * 1024


shall we increase the number of rows?

I'm afraid the resulting parquet/orc file is too small to benchmark PPD.

Yep. I'll increase to 1024 * 1024 * 15.

dongjoon-hyun · 2018-01-17T08:13:45Z

sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala

+          filterPushDownBenchmark(numRows, title, whereExpr)
+        }
+
+        val selectExpr = (1 to width).map(i => s"LENGTH(c$i)").mkString("SUM(", "+", ")")


Since the data is increased, I used SUM(LENGTH(c1)+...) instead of * for the following cases.

maybe simply max(c1), max(c2), ...?

I see. I'll update and rerun.

gatorsmile · 2018-01-17T08:25:30Z

ORC performs further better when the number of columns is small. Maybe also add test cases back to show this observations?

dongjoon-hyun · 2018-01-17T09:19:25Z

@gatorsmile . The number of rows are also changed. Why do you think so?

dongjoon-hyun · 2018-01-17T09:25:25Z

There might be many questions about ORC (or Parquet) performance benchmarks. We can do that later. We cannot enumerate all cases at one time. Also, users can do that for their own workload. In fact, Apache Spark didn't show this kind of benchmark when it turned on PPD for Parquet. If there is a benchmark for Parquet, this PR will be a piece of cake.

I think this PR is enough to show the benefit of ORC PPD for enabling the config.

SparkQA · 2018-01-17T11:30:30Z

Test build #86251 has finished for PR 20265 at commit a556169.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T12:42:54Z

Test build #86257 has finished for PR 20265 at commit eb7035d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-17T13:53:51Z

thanks, merging to master/2.3!

## What changes were proposed in this pull request? ORC filter push-down is disabled by default from the beginning, [SPARK-2883](aa31e43#diff-41ef65b9ef5b518f77e2a03559893f4dR149 ). Now, Apache Spark starts to depend on Apache ORC 1.4.1. For Apache Spark 2.3, this PR turns on ORC filter push-down by default like Parquet ([SPARK-9207](https://issues.apache.org/jira/browse/SPARK-21783)) as a part of [SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901), "Feature parity for ORC with Parquet". ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20265 from dongjoon-hyun/SPARK-21783. (cherry picked from commit 0f8a286) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2018-01-17T16:11:28Z

Thank you so much, @cloud-fan and @gatorsmile !

dongjoon-hyun mentioned this pull request Jan 14, 2018

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

Closed

dongjoon-hyun commented Jan 14, 2018

View reviewed changes

gatorsmile reviewed Jan 14, 2018

View reviewed changes

cloud-fan reviewed Jan 15, 2018

View reviewed changes

[SPARK-21783][SQL] Turn on ORC filter push-down by default

440f76b

dongjoon-hyun commented Jan 16, 2018

View reviewed changes

gatorsmile reviewed Jan 16, 2018

View reviewed changes

cloud-fan reviewed Jan 16, 2018

View reviewed changes

Address comments

87af693

cloud-fan reviewed Jan 17, 2018

View reviewed changes

Increase the number of rows and reduce the number of columns

a556169

dongjoon-hyun commented Jan 17, 2018

View reviewed changes

Address comments

eb7035d

asfgit closed this in 0f8a286 Jan 17, 2018

dongjoon-hyun deleted the SPARK-21783 branch January 17, 2018 16:11

[SPARK-21783][SQL] Turn on ORC filter push-down by default #20265

[SPARK-21783][SQL] Turn on ORC filter push-down by default #20265

Uh oh!

Conversation

dongjoon-hyun commented Jan 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jan 14, 2018

Uh oh!

SparkQA commented Jan 14, 2018

Uh oh!

dongjoon-hyun Jan 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 14, 2018 •

edited

Loading

dongjoon-hyun Jan 14, 2018 •

edited

Loading

cloud-fan Jan 15, 2018 •

edited

Loading

gatorsmile Jan 16, 2018 •

edited

Loading