Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 4, 2024

What changes were proposed in this pull request?

Fix self-join after applyInArrow, the same issue of applyInPandas was fixed in #31429

Why are the changes needed?

bug fix

before:

In [1]: import pyarrow as pa

In [2]: df = spark.createDataFrame([(1, 1)], ("k", "v"))

In [3]: def arrow_func(key, table):
   ...:     return pa.Table.from_pydict({"x": [2], "y": [2]})
   ...:

In [4]: df2 = df.groupby("k").applyInArrow(arrow_func, schema="x long, y long")

In [5]: df2.show()
24/12/04 17:47:43 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found
+---+---+
|  x|  y|
+---+---+
|  2|  2|
+---+---+


In [6]: df2.join(df2)
...
Failure when resolving conflicting references in Join:
'Join Inner
:- FlatMapGroupsInArrow [k#0L], arrow_func(k#0L, v#1L)#2, [x#3L, y#4L]
:  +- Project [k#0L, k#0L, v#1L]
:     +- LogicalRDD [k#0L, v#1L], false
+- FlatMapGroupsInArrow [k#12L], arrow_func(k#12L, v#13L)#2, [x#3L, y#4L]
   +- Project [k#12L, k#12L, v#13L]
      +- LogicalRDD [k#12L, v#13L], false

Conflicting attributes: "x", "y". SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:79)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:798)

after:

In [6]: df2.join(df2)
Out[6]: DataFrame[x: bigint, y: bigint, x: bigint, y: bigint]

In [7]: df2.join(df2).show()
+---+---+---+---+
|  x|  y|  x|  y|
+---+---+---+---+
|  2|  2|  2|  2|
+---+---+---+---+

Does this PR introduce any user-facing change?

bug fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the fix_arrow_join branch December 5, 2024 00:52
HyukjinKwon pushed a commit that referenced this pull request Dec 6, 2024
…eRelations#collectConflictPlans`

### What changes were proposed in this pull request?
Add applyInArrow in `DeduplicateRelations#collectConflictPlans`

### Why are the changes needed?
In #49056, I forgot to add `applyInArrow` in `DeduplicateRelations#collectConflictPlans`

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests added in #49056

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49069 from zhengruifeng/apply_in_arrow_rule.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@Zand100
Copy link

Zand100 commented Dec 13, 2024

Hi @zhengruifeng does this pull request fix a bug introduced in #41347 ? We maintain a fork of spark, and we're wondering if we need to cherry-pick this bug fix now. We don't have #41347 in our fork. (If we don't need to cherry-pick this bug fix, we'll get all these commits when we upgrade.) Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants