-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true #4806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #28050 has started for PR 4806 at commit
|
Test build #28050 has finished for PR 4806 at commit
|
Test PASSed. |
@yhuai is this pr also dealing with one of the problems reported in #4729? Looks like it is. If so, can we solve the problem in the original pr? Or please suggest on it, instead of just making each sub-problem of it as new pr such as #4782 and this? It makes the original pr hard to manage and modify. Thanks! |
Besides, I think that it is weird to manually set up the So the main point is still the problem of inserting JSON data to parquet data source table. I did in #4729 just copy the schema of JSON data and modify its Both solutions are working on the unit test. @liancheng @yhuai you can decide which one is more proper. |
@viirya Making complex types in JSON relation always nullable could be more robust, and makes more sense for most common use cases. We don't want to get a schema with wrong nullability when we happened to only sampled records without nulls. So this problem itself should be fixed anyway regardless of SPARK-5950. |
Merging to master and branch-1.3, thanks! |
…insNull of an ArrayType to true Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go. JIRA: https://issues.apache.org/jira/browse/SPARK-6052 Author: Yin Huai <yhuai@databricks.com> Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits: 05eab9d [Yin Huai] Change containsNull to true. (cherry picked from commit 3efd8bb) Signed-off-by: Cheng Lian <lian@databricks.com>
I think your suggestion to completely remove nullablity may be considered if it is useless at all. |
Yeah, we introduced it for potential optimizations, but seems that it's causing more troubles. We decided to ignore nullability in Parquet and JSON data sources because this seems to be making more sense for most scenarios, especially when dealing with "dirty" datasets. However, completely ignoring nullability in Spark SQL also means that we lose part of the schema information, which affects data sources like Avro, ProtocolBuffer, and Thrift. Not quite sure whether this is a good idea for now... |
Always set
containsNull = true
when infer the schema of JSON datasets. If we setcontainsNull
based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always settingcontainsNull = true
is a more robust way to go.JIRA: https://issues.apache.org/jira/browse/SPARK-6052