Skip to content

[SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull. #1889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Aug 11, 2014

Make ScalaReflection be able to handle like:

  • Seq[Int] as ArrayType(IntegerType, containsNull = false)
  • Seq[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
  • Map[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
  • Map[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true)

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1889. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18307/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1889:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18307/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1889. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18310/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1889:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18310/consoleFull

@@ -372,7 +372,7 @@ object MapType {
* The `valueContainsNull` is true.
*/
def apply(keyType: DataType, valueType: DataType): MapType =
MapType(keyType: DataType, valueType: DataType, true)
MapType(keyType: DataType, valueType: DataType, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to line up with the scala doc above. Why did you change this? /cc @yhuai

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to change the doc.

I think the default value should be false because non nullable by default is more natural.
For example, when we think Map[Int, Long] variable, it can't contain null value.
If we want to add null value to the map, we have to declare the variable as Map[Int, Any] or Map[Int, java.lang.Long] or something like that.
It is the same way when thinking about the data type.

And this is the same as ArrayType.containsNull's default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that actually makes me think that we should change ArrayTypes default value as well. Here is my reason: Marking something as not nullable is purely an optimization as adding null checking logic will never cause the answer to be incorrect. If the user wants that optimization great, but it seems dangerous to assume that we can apply it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree with the dangerousness.
I'll revert the change.
And should I change the ArrayType's default value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1889. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18363/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1889:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18363/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1889. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18366/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1889:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18366/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1889. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18368/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1889:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18368/consoleFull

@yhuai
Copy link
Contributor

yhuai commented Aug 12, 2014

@ueshin If we are changing the default value of ArrayType.containsNull to true, we may also need to add logic to set that value based on data sources. In https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L78, currently, there is no pattern for ArrayType.containsNull=true. If we do not add logic to set ArrayType.containsNull or support null values in arrays in our Parquet support (I am not sure if Parquet can do this, but seems people are working on it for Hive https://issues.apache.org/jira/browse/HIVE-6994), seems some existing working cases will break.

@marmbrus
Copy link
Contributor

It would be great to get this change in for 1.1 since its kinda of an API change. However, for now I'm okay if trying to store nulls into parquet arrays throws exceptions at runtime since that has never worked. We'll probably need to handle the case where the schema says there could be nulls and there aren't actually any though.

@ueshin
Copy link
Member Author

ueshin commented Aug 13, 2014

I noticed that currently Parquet support can't handle MapType containing null value.
There is a following difference, though:

  • writing and reading of ArrayType throw exception when containsNull is true even if no null value is contained.
  • writing and reading of MapType can do if the map doesn't have null value regardless of valueContainsNull.

Should I modify Parquet support to handle ArrayType as the same as MapType, i.e. do writing and reading regardless of containsNull for now? (if contains null values, it throws runtime exception.)
And handling null value for both ArrayType and MapType would be the next issue?

@ueshin ueshin changed the title [SPARK-2969][SQL] Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull. [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull. Aug 13, 2014
@ueshin
Copy link
Member Author

ueshin commented Aug 13, 2014

Or do we have to make Parquet support be able to handle null values now?
Parquet format will change to apply changes.

@marmbrus
Copy link
Contributor

@ueshin, thanks a lot for investigating this further! This is super important and I have been meaning to get to it for awhile. Here are my thoughts:

  • We probably shouldn't throw an exception when trying to store arrays or maps just because containsNull/valueContainsNull=true. This is because those values mean "could contain null" not "do contain null", and due to hive semantics we are often very conservative is stating nullable=false.
  • It would be great if you could explain how the format is going to change to handle null values. Is there consensus in the parquet community about how to encode this? Will the change be backwards incompatible?
  • If its going to be backwards incompatible then it would be really good to make the change before 1.1. Please open a blocker JIRA targeted at 1.1 if that is the case. If we don't need to make backwards incompatible changes then this is more a "very nice to have" for 1.1. I'm okay throwing exceptions saying "not supported" when people try to store null values into arrays or maps (though this is less than ideal obviously).

Thanks again!

@ueshin
Copy link
Member Author

ueshin commented Aug 14, 2014

Hi @marmbrus, I filed 2 issues on JIRA for MapType and ArrayType.

Could you check them?

@ueshin
Copy link
Member Author

ueshin commented Aug 14, 2014

I have to say that I'm not sure there is consensus in the parquet community.
I just compared to Hive implementation.
But I think these are reasonable ways to handle null values.

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 1889 at commit 24f1c5c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 1889 at commit 24f1c5c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 98c2bb0 Aug 26, 2014
asfgit pushed a commit that referenced this pull request Aug 26, 2014
…ntainsNull and MapType.valueContainsNull.

Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.

(cherry picked from commit 98c2bb0)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus
Copy link
Contributor

Thanks! I've merged this to master and 1.1.

kayousterhout pushed a commit to kayousterhout/spark-1 that referenced this pull request Aug 27, 2014
…ntainsNull and MapType.valueContainsNull.

Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes apache#1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
asfgit pushed a commit that referenced this pull request Aug 27, 2014
…alue support to Parquet.

JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037

Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:

```
message root {
  optional group a (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required int32 key;
      optional int32 value;
    }
  }
}
```

for `ArrayType` when `containsNull` is `true`:

```
message root {
  optional group a (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
```

We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.

Notice:
This PR is based on #1963 and #1889.
Please check them first.

/cc marmbrus, yhuai

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:

4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037

(cherry picked from commit 727cb25)
Signed-off-by: Michael Armbrust <michael@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 27, 2014
…alue support to Parquet.

JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037

Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:

```
message root {
  optional group a (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required int32 key;
      optional int32 value;
    }
  }
}
```

for `ArrayType` when `containsNull` is `true`:

```
message root {
  optional group a (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
```

We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.

Notice:
This PR is based on #1963 and #1889.
Please check them first.

/cc marmbrus, yhuai

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:

4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ntainsNull and MapType.valueContainsNull.

Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes apache#1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…alue support to Parquet.

JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037

Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:

```
message root {
  optional group a (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required int32 key;
      optional int32 value;
    }
  }
}
```

for `ArrayType` when `containsNull` is `true`:

```
message root {
  optional group a (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
```

We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.

Notice:
This PR is based on apache#1963 and apache#1889.
Please check them first.

/cc marmbrus, yhuai

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes apache#2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:

4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037
asfgit pushed a commit that referenced this pull request Sep 13, 2014
…value of containsNull in an ArrayType

After #1889, the default value of `containsNull` in an `ArrayType` is `true`.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2374 from yhuai/containsNull and squashes the following commits:

dc609a3 [Yin Huai] Update the SQL programming guide to show the correct default value of containsNull in an ArrayType (the default value is true instead of false).
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants