[SPARK-47563][SQL] Add map normalization on creation #45721

stevomitric · 2024-03-26T10:20:14Z

What changes were proposed in this pull request?

Added normalization of map keys when they are put in ArrayBasedMapBuilder.

Why are the changes needed?

As map keys need to be unique, we need to add normalization on floating point numbers and prevent the following case when building a map: Map(0.0, -0.0).
This further unblocks GROUP BY statement for Map Types as per this discussion.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UTs in ArrayBasedMapBuilderSuite

Was this patch authored or co-authored using generative AI tooling?

No

stefankandic · 2024-03-26T10:35:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>
+      new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>


if we have an array of 1 million strings we will go through each value even though we know we don't need to normalize strings

what about doing the same as in NormalizeFloatingNumbers and first check if we need to perform normalization

Applied NormalizeFloatingNumbers.needNormalize here

stefankandic · 2024-03-26T10:36:14Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala

+        "mapKeyDedupPolicy" -> "\"spark.sql.mapKeyDedupPolicy\"")
+    )
+
+    val builderStruct = new ArrayBasedMapBuilder(new StructType().add("i", "double"), IntegerType)


maybe add a case when array is inside of a struct

stefankandic · 2024-03-26T10:36:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala

@@ -60,6 +60,40 @@ class ArrayBasedMapBuilderSuite extends SparkFunSuite with SQLHelper {
    )
  }

+  test("apply key normalization when creating") {


add another test for successful normalization

stefankandic · 2024-03-26T10:45:41Z

please add info in the description on why the change is needed, ie right now we can create a map with both keys -0 and 0 etc

stefankandic · 2024-03-26T11:46:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+    case StructType(sf) =>
+      new GenericInternalRow(
+        value.asInstanceOf[GenericInternalRow].values.zipWithIndex.map { element =>
+        normalize(element._1, sf(element._2).dataType)


you could also check if you need to do normalization here right?

this way we would avoid normalization of all fields of a struct if only one actually needs it

As noted by @cloud-fan below, complex types have been dropped.

cloud-fan · 2024-03-26T13:17:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

@@ -52,18 +54,36 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria

  private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)

+  private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)
+
+  def normalize(value: Any, dataType: DataType): Any = dataType match {


we should return a lambda function to do normalization based on the data type, instead of matching the data type per row.

cloud-fan · 2024-03-26T13:18:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>


no need to handle complex types, as we use TreeMap for complex types which should handle floating points well.

…type

cloud-fan · 2024-03-27T04:56:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  private lazy val keyNeedNormalize =
+    keyType.isInstanceOf[FloatType] || keyType.isInstanceOf[DoubleType]
+
+  def normalize(dataType: DataType): Any => Any = dataType match {


private lazy val keyNormalizer: Any => Any = keyType match { case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER case _ => identity }

then the can just write

val keyNormalized = keyNormalizer(key)

cloud-fan · 2024-03-27T10:56:50Z

thanks, merging to master!

### What changes were proposed in this pull request? Added normalization of map keys when they are put in `ArrayBasedMapBuilder`. ### Why are the changes needed? As map keys need to be unique, we need to add normalization on floating point numbers and prevent the following case when building a map: `Map(0.0, -0.0)`. This further unblocks GROUP BY statement for Map Types as per [this discussion](apache#45549 (comment)). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs in `ArrayBasedMapBuilderSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45721 from stevomitric/stevomitric/fix-map-dup. Authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Added map normalization on creation

95aad6e

github-actions bot added the SQL label Mar 26, 2024

stevomitric mentioned this pull request Mar 26, 2024

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

Closed

stefankandic reviewed Mar 26, 2024

View reviewed changes

stevomitric added 3 commits March 26, 2024 12:06

Added a check if a key needs to be normalized

9231547

new test with struct of arrays

9649c5d

Added test for successful map normalization

5f008da

stefankandic reviewed Mar 26, 2024

View reviewed changes

cloud-fan reviewed Mar 26, 2024

View reviewed changes

stevomitric added 2 commits March 26, 2024 16:39

Changed normalize function to return a lambda function based on data …

cd4bb3d

…type

reverted needNormalize to private scope

15065a1

stevomitric requested a review from cloud-fan March 26, 2024 15:42

cloud-fan reviewed Mar 27, 2024

View reviewed changes

refactored normalizer function

e9322d3

stefankandic approved these changes Mar 27, 2024

View reviewed changes

cloud-fan approved these changes Mar 27, 2024

View reviewed changes

cloud-fan closed this in 87449c3 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47563][SQL] Add map normalization on creation #45721

[SPARK-47563][SQL] Add map normalization on creation #45721

Uh oh!

stevomitric commented Mar 26, 2024 •

edited

Loading

Uh oh!

stefankandic Mar 26, 2024

Uh oh!

stevomitric Mar 26, 2024

Uh oh!

stefankandic Mar 26, 2024

Uh oh!

stefankandic Mar 26, 2024

Uh oh!

stefankandic commented Mar 26, 2024

Uh oh!

stefankandic Mar 26, 2024

Uh oh!

stevomitric Mar 26, 2024

Uh oh!

cloud-fan Mar 26, 2024

Uh oh!

cloud-fan Mar 26, 2024

Uh oh!

cloud-fan Mar 27, 2024

Uh oh!

cloud-fan commented Mar 27, 2024

Uh oh!

Uh oh!

[SPARK-47563][SQL] Add map normalization on creation #45721

[SPARK-47563][SQL] Add map normalization on creation #45721

Uh oh!

Conversation

stevomitric commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefankandic commented Mar 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 27, 2024

Uh oh!

Uh oh!

stevomitric commented Mar 26, 2024 •

edited

Loading