[SPARK-13511][SQL] Add wholestage codegen for limit #11391

viirya · 2016-02-26T10:29:23Z

JIRA: https://issues.apache.org/jira/browse/SPARK-13511

What changes were proposed in this pull request?

Current limit operator doesn't support wholestage codegen. This is open to add support for it.

In the doConsume of GlobalLimit and LocalLimit, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable stopEarly of BufferedRowIterator newly added in this pr to true that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks shouldStop(), it will stop the processing of the row iterator.

Before this, the executed plan for a query sqlContext.range(N).limit(100).groupBy().sum() is:

TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L])
+- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L])
   +- GlobalLimit 100
      +- Exchange SinglePartition, None
         +- LocalLimit 100
            +- Range 0, 1, 1, 524288000, [id#5L]

After add wholestage codegen support:

WholeStageCodegen
:  +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L])
:     +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L])
:        +- GlobalLimit 100
:           +- INPUT
+- Exchange SinglePartition, None
   +- WholeStageCodegen
      :  +- LocalLimit 100
      :     +- Range 0, 1, 1, 524288000, [id#40L]

How was this patch tested?

A test is added into BenchmarkWholeStageCodegen.

SparkQA · 2016-02-26T11:09:18Z

Test build #52046 has finished for PR 11391 at commit a213ed1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode with CodegenSupport
- trait BaseLimit extends UnaryNode with CodegenSupport

viirya · 2016-02-26T11:25:34Z

Just realized this implementation is wrong. Will fix it later.

hvanhovell · 2016-02-26T12:00:16Z

Mark it as WIP?

hvanhovell · 2016-02-26T12:00:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

+       |   $countTerm += 1;
+       |   ${consume(ctx, ctx.currentVars)}
+       | }
+       | if (true) return;


This is special?

SparkQA · 2016-02-26T15:29:08Z

Test build #52053 has finished for PR 11391 at commit fa38e7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-26T21:40:51Z

Can you update the pr description to describe briefly how you are supporting it?

SparkQA · 2016-02-27T08:14:52Z

Test build #52107 has finished for PR 11391 at commit 8bac699.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-27T09:02:02Z

Test build #52110 has finished for PR 11391 at commit 9c072aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode

viirya · 2016-02-27T10:54:19Z

The failed test ParquetHadoopFsRelationSuite is due to the lack of short type support in UnsafeRowParquetRecordReader. I submitted another PR #11412 to fix it. The change is also included here to prove the tests can be passed.

SparkQA · 2016-02-27T12:39:25Z

Test build #52118 has finished for PR 11391 at commit 8e69d8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-02-29T12:59:41Z

cc @davies

davies · 2016-02-29T18:02:13Z

...in/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java

@@ -765,6 +765,9 @@ private void readIntBatch(int rowId, int num, ColumnVector column) throws IOExce
      } else if (DecimalType.is64BitDecimalType(column.dataType())) {
        defColumn.readIntsAsLongs(
            num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+      } else if (column.dataType() == DataTypes.ShortType) {


Could you pull this out as an separate PR? so we can merge it quickly.

Ah, it has been merged, let me sync with it.

SparkQA · 2016-03-01T05:23:20Z

Test build #52213 has finished for PR 11391 at commit c887cf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-01T05:51:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/BufferedRowIterator.java

@@ -35,6 +35,8 @@
  // used when there is no column in output
  protected UnsafeRow unsafeRow = new UnsafeRow(0);

+  protected boolean stopEarly = false;


Since stopEarly is only accessed generated functions, we don't need this anymore.

We could use addMutableState

yeah. I am updating it.

davies · 2016-03-01T06:41:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

+
+    assert(plan.find(p =>
+      p.isInstanceOf[WholeStageCodegen] &&
+        p.asInstanceOf[WholeStageCodegen].plan.isInstanceOf[Sort] &&


The sort is not related to limit, could you remove it from this PR? (we may revert the commit for sort)

Yeah, because we can't leave limit as last operator otherwise it will transform to collect limit, so I add a sort here. I will remove it once I am back to laptop (few hours later).

These kind of tests are easy to break, we may don't need this.

Agreed. Let me remove this later.

SparkQA · 2016-03-01T08:15:10Z

Test build #52219 has finished for PR 11391 at commit 8d254d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T08:37:07Z

Test build #52222 has finished for PR 11391 at commit b64e52d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T13:19:12Z

Test build #52237 has finished for PR 11391 at commit 3d1e397.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-01T16:42:45Z

LGTM, merging this into master, thanks!

JIRA: https://issues.apache.org/jira/browse/SPARK-13511 ## What changes were proposed in this pull request? Current limit operator doesn't support wholestage codegen. This is open to add support for it. In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator. Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is: TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L]) +- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L]) +- GlobalLimit 100 +- Exchange SinglePartition, None +- LocalLimit 100 +- Range 0, 1, 1, 524288000, [id#5L] After add wholestage codegen support: WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L]) : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L]) : +- GlobalLimit 100 : +- INPUT +- Exchange SinglePartition, None +- WholeStageCodegen : +- LocalLimit 100 : +- Range 0, 1, 1, 524288000, [id#40L] ## How was this patch tested? A test is added into BenchmarkWholeStageCodegen. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11391 from viirya/wholestage-limit.

Add wholestage codegen for limit.

a213ed1

hvanhovell reviewed Feb 26, 2016
View reviewed changes

viirya changed the title ~~[SPARK-13511][SQL] Add wholestage codegen for limit~~ [SPARK-13511][SQL][WIP] Add wholestage codegen for limit Feb 26, 2016

Fix implementation.

fa38e7c

CollectLimit doesn't need wholestage codegen.

9c072aa

viirya force-pushed the wholestage-limit branch from 8bac699 to 9c072aa Compare February 27, 2016 07:33

viirya added 2 commits February 27, 2016 10:22

Merge remote-tracking branch 'upstream/master' into wholestage-limit

74b2337

Fix test.

8e69d8d

viirya changed the title ~~[SPARK-13511][SQL][WIP] Add wholestage codegen for limit~~ [SPARK-13511][SQL] Add wholestage codegen for limit Feb 28, 2016

davies reviewed Feb 29, 2016
View reviewed changes

viirya added 2 commits March 1, 2016 03:03

Merge remote-tracking branch 'upstream/master' into wholestage-limit

87ceb4b

Address reviewer comments.

c887cf4

davies reviewed Mar 1, 2016
View reviewed changes

viirya added 2 commits March 1, 2016 06:17

Remove stopEarly from BufferedRowIterator.

8d254d2

Add unit test.

b64e52d

davies reviewed Mar 1, 2016
View reviewed changes

Remove unnecessary test.

3d1e397

asfgit closed this in c43899a Mar 1, 2016

viirya deleted the wholestage-limit branch December 27, 2023 18:33

[SPARK-13511][SQL] Add wholestage codegen for limit #11391

[SPARK-13511][SQL] Add wholestage codegen for limit #11391

Uh oh!

Conversation

viirya commented Feb 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 26, 2016

Uh oh!

viirya commented Feb 26, 2016

Uh oh!

hvanhovell commented Feb 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 26, 2016

Uh oh!

rxin commented Feb 26, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

viirya commented Feb 27, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

viirya commented Feb 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

davies commented Mar 1, 2016

Uh oh!

Uh oh!