[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

zhengruifeng · 2024-12-04T09:57:59Z

What changes were proposed in this pull request?

Fix self-join after applyInArrow, the same issue of applyInPandas was fixed in #31429

Why are the changes needed?

bug fix

before:

In [1]: import pyarrow as pa

In [2]: df = spark.createDataFrame([(1, 1)], ("k", "v"))

In [3]: def arrow_func(key, table):
   ...:     return pa.Table.from_pydict({"x": [2], "y": [2]})
   ...:

In [4]: df2 = df.groupby("k").applyInArrow(arrow_func, schema="x long, y long")

In [5]: df2.show()
24/12/04 17:47:43 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found
+---+---+
|  x|  y|
+---+---+
|  2|  2|
+---+---+


In [6]: df2.join(df2)
...
Failure when resolving conflicting references in Join:
'Join Inner
:- FlatMapGroupsInArrow [k#0L], arrow_func(k#0L, v#1L)#2, [x#3L, y#4L]
:  +- Project [k#0L, k#0L, v#1L]
:     +- LogicalRDD [k#0L, v#1L], false
+- FlatMapGroupsInArrow [k#12L], arrow_func(k#12L, v#13L)#2, [x#3L, y#4L]
   +- Project [k#12L, k#12L, v#13L]
      +- LogicalRDD [k#12L, v#13L], false

Conflicting attributes: "x", "y". SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:79)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:798)

after:

In [6]: df2.join(df2)
Out[6]: DataFrame[x: bigint, y: bigint, x: bigint, y: bigint]

In [7]: df2.join(df2).show()
+---+---+---+---+
|  x|  y|  x|  y|
+---+---+---+---+
|  2|  2|  2|  2|
+---+---+---+---+

Does this PR introduce any user-facing change?

bug fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

HyukjinKwon · 2024-12-05T00:41:02Z

Merged to master.

…eRelations#collectConflictPlans` ### What changes were proposed in this pull request? Add applyInArrow in `DeduplicateRelations#collectConflictPlans` ### Why are the changes needed? In #49056, I forgot to add `applyInArrow` in `DeduplicateRelations#collectConflictPlans` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests added in #49056 ### Was this patch authored or co-authored using generative AI tooling? no Closes #49069 from zhengruifeng/apply_in_arrow_rule. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Zand100 · 2024-12-13T21:28:54Z

Hi @zhengruifeng does this pull request fix a bug introduced in #41347 ? We maintain a fork of spark, and we're wondering if we need to cherry-pick this bug fix now. We don't have #41347 in our fork. (If we don't need to cherry-pick this bug fix, we'll get all these commits when we upgrade.) Thank you!

fix

fcb92b5

github-actions bot added SQL PYTHON labels Dec 4, 2024

zhengruifeng requested review from HyukjinKwon and Ngone51 December 4, 2024 10:00

fix lint

8aa3772

HyukjinKwon approved these changes Dec 5, 2024

View reviewed changes

HyukjinKwon closed this in 7278bc7 Dec 5, 2024

zhengruifeng deleted the fix_arrow_join branch December 5, 2024 00:52

zhengruifeng mentioned this pull request Dec 5, 2024

[SPARK-50489][SQL][PYTHON][FOLLOW-UP] Add applyInArrow in DeduplicateRelations#collectConflictPlans #49069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

Uh oh!

zhengruifeng commented Dec 4, 2024 •

edited

Loading

Uh oh!

HyukjinKwon commented Dec 5, 2024

Uh oh!

Zand100 commented Dec 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

Uh oh!

Conversation

zhengruifeng commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Dec 5, 2024

Uh oh!

Zand100 commented Dec 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

zhengruifeng commented Dec 4, 2024 •

edited

Loading