Skip to content

[SPARK-30790][SQL] The dataType of map() should be map<null,null> #27542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

iRakson
Copy link
Contributor

@iRakson iRakson commented Feb 11, 2020

What changes were proposed in this pull request?

spark.sql("select map()") returns {}.

After these changes it will return map<null,null>

Why are the changes needed?

After changes introduced due to #27521, it is important to maintain consistency while using map().

Does this PR introduce any user-facing change?

Yes. Now map() will give map<null,null> instead of {}.

How was this patch tested?

UT added. Migration guide updated as well

[SPARK-30790]map() should return map<null,null>
@cloud-fan
Copy link
Contributor

ok to test

@@ -218,6 +218,8 @@ license: |

- Since Spark 3.0, when the `array` function is called without any parameters, it returns an empty array of `NullType`. In Spark version 2.4 and earlier, it returns an empty array of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.

- Since Spark 3.0, when the `map` function is called without any parameters, it returns an empty map of `NullType`. In Spark version 2.4 and earlier, it returns an empty map of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.mapDefaultToStringType.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an empty map with NullType as key/value type.

val LEGACY_MAP_DEFAULT_TO_STRING =
buildConf("spark.sql.legacy.mapDefaultToStringType.enabled")
.internal()
.doc("When set to true, it returns an empty map of string type when the `map` " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

var expectedSchema = new StructType()
.add("x", MapType(StringType, StringType, valueContainsNull = false), nullable = false)
assert(ds.select(map().as("x")).schema == expectedSchema)
test("SPARK-30790: Empty map of <NullType,NullType> for map function with no arguments") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty map with NullType as key/value type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move it closer to the array test?

@@ -218,6 +218,8 @@ license: |

- Since Spark 3.0, when the `array` function is called without any parameters, it returns an empty array of `NullType`. In Spark version 2.4 and earlier, it returns an empty array of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.

- Since Spark 3.0, when the `map` function is called without any parameters, it returns an empty map of `NullType`. In Spark version 2.4 and earlier, it returns an empty map of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.mapDefaultToStringType.enabled` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should say that map's keys and values have either NullType or StringType.

Comment on lines 3506 to 3510
val schema = spark.range(1).select(map()).schema
assert(schema.nonEmpty && schema.head.dataType.isInstanceOf[MapType])
val actualKeyType = schema.head.dataType.asInstanceOf[MapType].keyType
val actualValueType = schema.head.dataType.asInstanceOf[MapType].valueType
assert(actualKeyType === expectedType && actualValueType === expectedType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

    val schema = spark.range(1).select(map()).schema
    val StructType(Array(StructField(_, MapType(keyType, valueType, _), _, _))) = schema
    assert(keyType === expectedType)
    assert(valueType === expectedType)

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118254 has finished for PR 27542 at commit 395eee8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 12, 2020

Just to check; we need independent legacy configs for arrays and maps?

@iRakson
Copy link
Contributor Author

iRakson commented Feb 12, 2020

Just to check; we need independent legacy configs for arrays and maps?

That is exactly my doubt as well. As per my opinion we should have a single config for both of them.

@cloud-fan
Copy link
Contributor

how about a single config spark.sql.legacy.createEmptyCollectionUsingStringType?

@iRakson
Copy link
Contributor Author

iRakson commented Feb 12, 2020

how about a single config spark.sql.legacy.createEmptyCollectionUsingStringType?

Seems fine to me. I will update code.


private lazy val keyToIndex = keyType match {
// Binary type data is `byte[]`, which can't use `==` to check equality.
case _: AtomicType | _: CalendarIntervalType if !keyType.isInstanceOf[BinaryType] =>
new java.util.HashMap[Any, Int]()
case _: AtomicType | _: CalendarIntervalType | _: NullType
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to handle this ->
scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)

@iRakson iRakson requested a review from cloud-fan February 12, 2020 09:50
- Since Spark 3.0, when the `array` function is called without any parameters, it returns an empty array of `NullType`. In Spark version 2.4 and earlier, it returns an empty array of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.
- Since Spark 3.0, when the `array` function is called without any parameters, it returns an empty array of `NullType`. In Spark version 2.4 and earlier, it returns an empty array of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.createEmptyCollectionUsingStringType.enabled` to `true`.

- Since Spark 3.0, when the `map` function is called without any parameters, it returns an empty map with `NullType` as key/value type. In Spark version 2.4 and earlier, it returns an empty map of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.createEmptyCollectionUsingStringType.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine the 2 items?

Since Spark 3.0, when the `array`/`map` function is called without any parameters, it returns an empty collection with `NullType` as element type. ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration guide is updated.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118291 has finished for PR 27542 at commit 3319d3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118296 has finished for PR 27542 at commit 1a066e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val LEGACY_ARRAY_DEFAULT_TO_STRING =
buildConf("spark.sql.legacy.arrayDefaultToStringType.enabled")
val LEGACY_CREATE_EMPTY_COLLECTION_USING_STRING_TYPE =
buildConf("spark.sql.legacy.createEmptyCollectionUsingStringType.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sent a config naming proposal to dev list, and createEmptyCollectionUsingStringType is a verb so I think we don't need the .enabled postfix.

.doc("When set to true, it returns an empty array of string type when the `array` " +
"function is called without any parameters. Otherwise, it returns an empty " +
"array of `NullType`")
.doc("When set to true, it returns an empty array of string type and an empty map with " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's match the migration guide:

When set to true, Spark returns an empty collection with `StringType` as element type
if the `array`/`map` function is called without any parameters. Otherwise, Spark returns
an empty collection with `NullType` as element type.

@iRakson iRakson requested a review from cloud-fan February 12, 2020 18:06
@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118315 has finished for PR 27542 at commit e873527.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 12, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118322 has finished for PR 27542 at commit e873527.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30790]The dataType of map() should be map<null,null> [SPARK-30790][SQL] The dataType of map() should be map<null,null> Feb 13, 2020
@cloud-fan cloud-fan closed this in 926e3a1 Feb 13, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Feb 13, 2020
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes #27542 from iRakson/SPARK-30790.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 926e3a1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Mar 17, 2020
### What changes were proposed in this pull request?

After #27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.

The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.

### Why are the changes needed?

To avoid breaking queries.

### Does this PR introduce any user-facing change?

Yes, now some queries that work in 2.x can work in 3.0 as well.

### How was this patch tested?

new test

Closes #27926 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Mar 17, 2020
After #27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.

The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.

To avoid breaking queries.

Yes, now some queries that work in 2.x can work in 3.0 as well.

new test

Closes #27926 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d7b97a1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to apache#27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes apache#27542 from iRakson/SPARK-30790.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

After apache#27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.

The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.

### Why are the changes needed?

To avoid breaking queries.

### Does this PR introduce any user-facing change?

Yes, now some queries that work in 2.x can work in 3.0 as well.

### How was this patch tested?

new test

Closes apache#27926 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants