Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction #29800

Closed
wants to merge 63 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Sep 18, 2020

What changes were proposed in this pull request?

Spark SQL supports some window function like NTH_VALUE.
If we specify window frame like UNBOUNDED PRECEDING AND CURRENT ROW or UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, we can elimate some calculations.
For example: if we execute the SQL show below:

SELECT NTH_VALUE(col,
         2) OVER(ORDER BY rank UNBOUNDED PRECEDING
        AND CURRENT ROW)
FROM tab;

The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2.
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.

Why are the changes needed?

Improve the performance for NTH_VALUE, FIRST_VALUE and LAST_VALUE.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34920/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34924/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34926/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34924/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34926/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34927/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34929/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34929/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34927/

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130322 has finished for PR 29800 at commit ca574d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130324 has finished for PR 29800 at commit fd59f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130326 has finished for PR 29800 at commit 9c35ddd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2020

Test build #130325 has finished for PR 29800 at commit 72e7805.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* 1. [[FrameLessOffsetWindowFunction]] returns the value of the input column offset by a number
* of rows according to the current row.
* 2. [[UnboundedOffsetWindowFunctionFrame]] and [[UnboundedPrecedingOffsetWindowFunctionFrame]]
* returns the value of the input column offset by a number of rows within the partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

within the partition -> within the frame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

if (inputIterator.hasNext) inputIterator.next()
inputIndex += 1
}
if (inputIndex >= 0 && inputIndex < input.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputIndex >= 0 seems always true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

-- !query output
Larry Bott 11798 NULL
Gerard Bondur 11472 Gerard Bondur
Pamela Castillo 11303 Gerard Bondur
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR. We should fix the test framework so that the result is always aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output comes from hiveResultString

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34950/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34950/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130348 has finished for PR 29800 at commit 1c0e82b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3c3ad5f Oct 28, 2020
@beliefer
Copy link
Contributor Author

@cloud-fan Thanks for your help!

cloud-fan pushed a commit that referenced this pull request Nov 12, 2020
### What changes were proposed in this pull request?
#29800 provides a performance improvement for `NTH_VALUE`.
`FIRST_VALUE` also could use the `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame`.

### Why are the changes needed?
Improve the performance for `FIRST_VALUE`.

### Does this PR introduce _any_ user-facing change?
 'No'.

### How was this patch tested?
Jenkins test.

Closes #30178 from beliefer/SPARK-33278.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants