Skip to content

[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true #4806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Feb 27, 2015

Always set containsNull = true when infer the schema of JSON datasets. If we set containsNull based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting containsNull = true is a more robust way to go.

JIRA: https://issues.apache.org/jira/browse/SPARK-6052

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28050 has started for PR 4806 at commit 05eab9d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28050 has finished for PR 4806 at commit 05eab9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28050/
Test PASSed.

@viirya
Copy link
Member

viirya commented Feb 27, 2015

@yhuai is this pr also dealing with one of the problems reported in #4729? Looks like it is. If so, can we solve the problem in the original pr? Or please suggest on it, instead of just making each sub-problem of it as new pr such as #4782 and this? It makes the original pr hard to manage and modify. Thanks!

@viirya
Copy link
Member

viirya commented Feb 27, 2015

Besides, I think that it is weird to manually set up the containsNull for JSON schema inference. Sampling should not be an issue because you can also argue that we may miss arrays with different column types.

So the main point is still the problem of inserting JSON data to parquet data source table. I did in #4729 just copy the schema of JSON data and modify its containsNull then use it for insertion, without actually modifying the schema of the JSON data.

Both solutions are working on the unit test. @liancheng @yhuai you can decide which one is more proper.

@liancheng
Copy link
Contributor

@viirya Making complex types in JSON relation always nullable could be more robust, and makes more sense for most common use cases. We don't want to get a schema with wrong nullability when we happened to only sampled records without nulls. So this problem itself should be fixed anyway regardless of SPARK-5950.

@liancheng
Copy link
Contributor

Merging to master and branch-1.3, thanks!

@asfgit asfgit closed this in 3efd8bb Mar 2, 2015
asfgit pushed a commit that referenced this pull request Mar 2, 2015
…insNull of an ArrayType to true

Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.

JIRA: https://issues.apache.org/jira/browse/SPARK-6052

Author: Yin Huai <yhuai@databricks.com>

Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits:

05eab9d [Yin Huai] Change containsNull to true.

(cherry picked from commit 3efd8bb)
Signed-off-by: Cheng Lian <lian@databricks.com>
@viirya
Copy link
Member

viirya commented Mar 2, 2015

I think your suggestion to completely remove nullablity may be considered if it is useless at all.

@liancheng
Copy link
Contributor

Yeah, we introduced it for potential optimizations, but seems that it's causing more troubles. We decided to ignore nullability in Parquet and JSON data sources because this seems to be making more sense for most scenarios, especially when dealing with "dirty" datasets.

However, completely ignoring nullability in Spark SQL also means that we lose part of the schema information, which affects data sources like Avro, ProtocolBuffer, and Thrift. Not quite sure whether this is a good idea for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants