Skip to content

Commit 9ab296e

Browse files
gatorsmiledavies
authored andcommitted
[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join
After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code. For example, users can do the Equi-Join like ```df.join(df2, 'name', 'outer').select('name', 'height').collect()``` - There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`). - After a PR: #8600, the 1.6 does not have such an issue, but the description has not been updated. Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join. Author: gatorsmile <gatorsmile@gmail.com> Closes #10477 from gatorsmile/pyOuterJoin.
1 parent 1e97813 commit 9ab296e

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

python/pyspark/sql/dataframe.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -608,13 +608,16 @@ def join(self, other, on=None, how=None):
608608
:param on: a string for join column name, a list of column names,
609609
, a join expression (Column) or a list of Columns.
610610
If `on` is a string or a list of string indicating the name of the join column(s),
611-
the column(s) must exist on both sides, and this performs an inner equi-join.
611+
the column(s) must exist on both sides, and this performs an equi-join.
612612
:param how: str, default 'inner'.
613613
One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.
614614
615615
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
616616
[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
617617
618+
>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
619+
[Row(name=u'Tom', height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
620+
618621
>>> cond = [df.name == df3.name, df.age == df3.age]
619622
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
620623
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

0 commit comments

Comments
 (0)