Spark 32443 debug #14

HyukjinKwon · 2020-07-27T08:44:21Z

No description provided.

…vailable

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun and others added 11 commits July 27, 2020 18:59

[SPARK-32443][CORE] Use POSIX compatible command -v in testCommandA…

5a8fbbe

…vailable

test

643c683

Check pythonExec

46a670a

a

042720f

fix typo

e5a1ad1

remove extra assumes

99a1375

always python3

1c9d9e2

test

49b9515

clean up

458da38

Use the old behavior for Windows

5164b4f

Update core/src/main/scala/org/apache/spark/TestUtils.scala

b46d67e

HyukjinKwon force-pushed the SPARK-32443-debug branch 3 times, most recently from 980279a to c3ee806 Compare July 27, 2020 13:04

Debug 29241

859046c

HyukjinKwon force-pushed the SPARK-32443-debug branch from c3ee806 to 859046c Compare July 27, 2020 13:39

HyukjinKwon added 3 commits July 27, 2020 23:14

another debug

01710b7

Use runtime instead

cef6b5d

Use sh

033f467

HyukjinKwon mentioned this pull request Jul 27, 2020

[SPARK-32443][CORE] Use POSIX-compatible command -v in testCommandAvailable apache/spark#29241

Closed

HyukjinKwon closed this Jul 27, 2020

HyukjinKwon deleted the SPARK-32443-debug branch December 7, 2020 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 32443 debug #14

Spark 32443 debug #14

Uh oh!

HyukjinKwon commented Jul 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Spark 32443 debug #14

Spark 32443 debug #14

Uh oh!

Conversation

HyukjinKwon commented Jul 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants