Skip to content

[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs. #32523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented May 12, 2021

What changes were proposed in this pull request?

This PR fixes the same issue as #32424.

from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)

Before:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

After:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

Why are the changes needed?

To produce the correct results.

Does this PR introduce any user-facing change?

Yes, it fixes the results to be correct as mentioned above.

How was this patch tested?

Added a unit test as well as manually.

@ueshin
Copy link
Member Author

ueshin commented May 12, 2021

cc @maropu @zero323 @HyukjinKwon

@ueshin ueshin marked this pull request as draft May 12, 2021 19:18
@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138470 has finished for PR 32523 at commit 0e66385.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42991/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42991/

@maropu
Copy link
Member

maropu commented May 13, 2021

The fix itself looks fine. Thank you, @ueshin ~

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good (except separating GA fix)

HyukjinKwon pushed a commit that referenced this pull request May 13, 2021
### What changes were proposed in this pull request?

This PR allows the PR source branch to include slashes.

### Why are the changes needed?

There are PRs whose source branches include slashes, like `issues/SPARK-35119/gha` here or #32523.

Before the fix, the PR build fails in `Sync the current branch with the latest in Apache Spark` phase.
For example, at #32523, the source branch is `issues/SPARK-35382/nested_higher_order_functions`:

```
...
fatal: couldn't find remote ref nested_higher_order_functions
Error: Process completed with exit code 128.
```

(https://github.com/ueshin/apache-spark/runs/2569356241)

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

This PR source branch includes slashes and #32525 doesn't.

Closes #32524 from ueshin/issues/SPARK-35119/gha.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@SparkQA
Copy link

SparkQA commented May 13, 2021

Test build #138483 has finished for PR 32523 at commit a8f169e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43003/

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43003/

@ueshin ueshin marked this pull request as ready for review May 13, 2021 05:27
@ueshin ueshin requested a review from HyukjinKwon May 13, 2021 05:36
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request May 13, 2021
…rame functions in Python APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as #32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 17b59a9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…rame functions in Python APIs

### What changes were proposed in this pull request?

This PR fixes the same issue as apache#32424.

```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
    transform(
        "numbers",
        lambda number: transform(
            "letters",
            lambda letter: struct(number.alias("n"), letter.alias("l"))
        )
    )
).alias("zipped")).show(truncate=False)
```

**Before:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```

**After:**

```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Added a unit test as well as manually.

Closes apache#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 17b59a9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
zhengruifeng added a commit that referenced this pull request Jan 17, 2023
…bda functions

### What changes were proposed in this pull request?
1, #39068 reused the `UnresolvedAttribute` for the `UnresolvedNamedLambdaVariable`, but then `Column('x')` and `UnresolvedNamedLambdaVariable('x')` are mixed in `lambda x: x + cdf.x` (since we use `x/y/z` as augment names); this PR adds the `UnresolvedNamedLambdaVariable` back to distinguish between `Column('x')` and `UnresolvedNamedLambdaVariable('x')`;

2, the `refreshVarName` logic in PySpark was added in #32523 to address similar issue in PySpark's Lambda Function, this PR adds a similar function in the Python Client to avoid rewriting the function expression in the server side, which is unnecessary and prone to error .

### Why are the changes needed?
before this PR, the nested lambda function doesn't work properly

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
enabled UT and added UT

Closes #39619 from zhengruifeng/connect_fix_nested_lambda.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants