[SPARK-33338][SQL] GROUP BY using literal map should not fail#30246
[SPARK-33338][SQL] GROUP BY using literal map should not fail#30246dongjoon-hyun wants to merge 2 commits intoapache:masterfrom dongjoon-hyun:SPARK-33338
Conversation
| case (null, null) => true | ||
| case (a: Array[Byte], b: Array[Byte]) => util.Arrays.equals(a, b) | ||
| case (a: ArrayBasedMapData, b: ArrayBasedMapData) => | ||
| a.keyArray == b.keyArray && a.valueArray == b.valueArray |
There was a problem hiding this comment.
GenericArrayData has equals.
There was a problem hiding this comment.
Quick question, why we don't have equals in ArrayBasedMapData?
There was a problem hiding this comment.
I also considered that way first, but I didn't do that because of this.
/**
* This is an internal data representation for map type in Spark SQL. This should not implement
* `equals` and `hashCode` because the type cannot be used as join keys, grouping keys, or
* in equality tests. See SPARK-9415 and PR#13847 for the discussions.
*/
abstract class MapData extends SerializableThere was a problem hiding this comment.
That's the reason why I focused on literal map equality only.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Could you review this, @cloud-fan and @maropu and @viirya ? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130596 has finished for PR 30246 at commit
|
| sql(s"CREATE TABLE t USING ORC LOCATION '${dir.toURI}' AS SELECT map('k1', 'v1') m, 'k1' k") | ||
| Seq( | ||
| "SELECT map('k1', 'v1')[k] FROM t GROUP BY 1", | ||
| "SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]", |
There was a problem hiding this comment.
how about SELECT map('k1', 'v1', 'k2', 'v2')[k] FROM t GROUP BY map('k2', 'v2', 'k1', 'v1')[k]?
There was a problem hiding this comment.
Since we don't normalize the literal maps, they are not the same maps, @cloud-fan . We should not handle it here, @cloud-fan .
|
Test build #130597 has finished for PR 30246 at commit
|
|
Thank you, @HyukjinKwon and @cloud-fan . |
### What changes were proposed in this pull request?
This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.
### Why are the changes needed?
This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3
scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```
Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```
**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```
**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the newly added test case.
Closes #30246 from dongjoon-hyun/SPARK-33338.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 42c0b17)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.
### Why are the changes needed?
This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3
scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```
Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```
**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```
**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the newly added test case.
Closes #30246 from dongjoon-hyun/SPARK-33338.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 42c0b17)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to fix
semanticEqualsworks correctly onGetMapValueexpressions having literal maps withArrayBasedMapDataandGenericArrayData.Why are the changes needed?
This is a regression from Apache Spark 1.6.x.
Apache Spark 2.x ~ 3.0.1 raise
RuntimeExceptionfor the following queries.BEFORE
AFTER
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass the CIs with the newly added test case.