Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49366][CONNECT] Treat Union node as leaf in dataframe column resolution #47853

Closed
wants to merge 6 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Aug 23, 2024

What changes were proposed in this pull request?

Treat Union node as leaf in column resolution

Why are the changes needed?

bug fix:

from pyspark.sql.functions import concat, lit, col
df1 = spark.range(10).withColumn("value", lit(1))
df2 = df1.union(df1)
df1.join(df2, df1.id == df2.id, "left").show()

fails with AMBIGUOUS_COLUMN_REFERENCE

resolveExpressionByPlanChildren: e = '`==`('id, 'id)
resolveExpressionByPlanChildren: q =
'[id=63]Join LeftOuter, '`==`('id, 'id)
:- [id=61]Project [id#550L, 1 AS value#553]
:  +- Range (0, 10, step=1, splits=Some(12))
+- [id=62]Union false, false
   :- [id=61]Project [id#564L, 1 AS value#565]
   :  +- Range (0, 10, step=1, splits=Some(12))
   +- [id=61]Project [id#566L, 1 AS value#567]
      +- Range (0, 10, step=1, splits=Some(12))

'id with id = 61

[id=61]Project [id#564L, 1 AS value#565]
+- Range (0, 10, step=1, splits=Some(12))

[id=61]Project [id#566L, 1 AS value#567]
+- Range (0, 10, step=1, splits=Some(12))


resolved: Vector((Some((id#564L,1)),true), (Some((id#566L,1)),true))

When resolving 'id with id = 61, existing detection fails in the second child.

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

No

@zhengruifeng zhengruifeng changed the title [SPARK-49366][CONNECT] Delay the ambiguous detection in column resolution [WIP][SPARK-49366][CONNECT] Delay the ambiguous detection in column resolution Aug 23, 2024
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-49366][CONNECT] Delay the ambiguous detection in column resolution [SPARK-49366][CONNECT] Delay the ambiguous detection in column resolution Aug 26, 2024
@zhengruifeng zhengruifeng changed the title [SPARK-49366][CONNECT] Delay the ambiguous detection in column resolution [SPARK-49366][CONNECT] Treat Union node as leaf in column resolution Aug 26, 2024
@zhengruifeng zhengruifeng changed the title [SPARK-49366][CONNECT] Treat Union node as leaf in column resolution [SPARK-49366][CONNECT] Treat Union node as leaf in dataframe column resolution Aug 26, 2024
resolveDataFrameColumnByPlanId(u, id, isMetadataAccess, p.children, currentDepth + 1)
val children = p match {
// treat Union node as the leaf node
case _: Union => Seq.empty[LogicalPlan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specific to spark connect right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specific to connect only.
spark classic won't reach this path

@zhengruifeng zhengruifeng deleted the fix_ambgious_union branch August 27, 2024 01:25
@zhengruifeng
Copy link
Contributor Author

merged to master

IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…esolution

### What changes were proposed in this pull request?

Treat Union node as leaf in column resolution

### Why are the changes needed?
bug fix:
```
from pyspark.sql.functions import concat, lit, col
df1 = spark.range(10).withColumn("value", lit(1))
df2 = df1.union(df1)
df1.join(df2, df1.id == df2.id, "left").show()
```
fails with `AMBIGUOUS_COLUMN_REFERENCE`

```
resolveExpressionByPlanChildren: e = '`==`('id, 'id)
resolveExpressionByPlanChildren: q =
'[id=63]Join LeftOuter, '`==`('id, 'id)
:- [id=61]Project [id#550L, 1 AS value#553]
:  +- Range (0, 10, step=1, splits=Some(12))
+- [id=62]Union false, false
   :- [id=61]Project [id#564L, 1 AS value#565]
   :  +- Range (0, 10, step=1, splits=Some(12))
   +- [id=61]Project [id#566L, 1 AS value#567]
      +- Range (0, 10, step=1, splits=Some(12))

'id with id = 61

[id=61]Project [id#564L, 1 AS value#565]
+- Range (0, 10, step=1, splits=Some(12))

[id=61]Project [id#566L, 1 AS value#567]
+- Range (0, 10, step=1, splits=Some(12))

resolved: Vector((Some((id#564L,1)),true), (Some((id#566L,1)),true))
```

When resolving `'id with id = 61`, existing detection fails in the second child.

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47853 from zhengruifeng/fix_ambgious_union.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…esolution

### What changes were proposed in this pull request?

Treat Union node as leaf in column resolution

### Why are the changes needed?
bug fix:
```
from pyspark.sql.functions import concat, lit, col
df1 = spark.range(10).withColumn("value", lit(1))
df2 = df1.union(df1)
df1.join(df2, df1.id == df2.id, "left").show()
```
fails with `AMBIGUOUS_COLUMN_REFERENCE`

```
resolveExpressionByPlanChildren: e = '`==`('id, 'id)
resolveExpressionByPlanChildren: q =
'[id=63]Join LeftOuter, '`==`('id, 'id)
:- [id=61]Project [id#550L, 1 AS value#553]
:  +- Range (0, 10, step=1, splits=Some(12))
+- [id=62]Union false, false
   :- [id=61]Project [id#564L, 1 AS value#565]
   :  +- Range (0, 10, step=1, splits=Some(12))
   +- [id=61]Project [id#566L, 1 AS value#567]
      +- Range (0, 10, step=1, splits=Some(12))

'id with id = 61

[id=61]Project [id#564L, 1 AS value#565]
+- Range (0, 10, step=1, splits=Some(12))

[id=61]Project [id#566L, 1 AS value#567]
+- Range (0, 10, step=1, splits=Some(12))

resolved: Vector((Some((id#564L,1)),true), (Some((id#566L,1)),true))
```

When resolving `'id with id = 61`, existing detection fails in the second child.

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47853 from zhengruifeng/fix_ambgious_union.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants