-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct #30467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/core/src/test/scala/org/apache/spark/sql/ComplexTypesSuite.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala
Outdated
Show resolved
Hide resolved
ok to test |
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala
Outdated
Show resolved
Hide resolved
cc @viirya and @gatorsmile |
I remember the related discussion in #29645 Btw, other systems support this resolution? |
Hmm, yeah, I remember the discussion with @cloud-fan. IIUC, it is not supported explicitly in Spark. |
Test build #131616 has finished for PR 30467 at commit
|
Google BigQuery and Amazon Athena are supposed to support nested arrays, but instead of analysing the nested data directly, they use utdf to unfold the data for analysis. Like this select c_column, d_column from table lateral view unnest(a.b.c, a.b.d) as c_column, d_column However, the spark sql syntax analysis phase is exceptional because no extractor for nested arrays could be found. |
Does this mean I need to close the pr? |
cc @cloud-fan |
Test build #131680 has finished for PR 30467 at commit
|
@guiyanakuang, can we make it matched the examples you mentioned? Also, just to double check, we don't have any reference for the support nested array support, right? |
@HyukjinKwon Are you referring to this example ? select c_column, d_column from table lateral view unnest(a.b.c, a.b.d) as c_column, d_column Yes, we can implement this example, it's just that Unnest utdf does not yet exist for spark, and we can use explore instead. val jsonStr1 = """{"a": [{"b": [{"c": [1,2]}]}]}"""
val jsonStr2 = """{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}"""
val df = spark.read.json(Seq(jsonStr1, jsonStr2).toDS())
df.createTempView("nested_arrays")
sql(
"""
|select
| c_u_u_u
|from
| nested_arrays
| lateral view explode(a.b.c) as c_u
| lateral view explode(c_u) as c_u_u
| lateral view explode(c_u_u) as c_u_u_u
|""".stripMargin).show()
+-------+
|c_u_u_u|
+-------+
| 1|
| 2|
| 1|
| 2|
+-------+ In fact, my company, Tencent, has implemented unnest in Spark, which is also referenced in Google's BigQuery. |
I don't think this is the right direction, as people will ask for more in the future, like supporting 3-level nested array. At the end, you can't even tell what |
I would like to explain a little more select c_u_u_u from nested_arrays lateral view unnest(a.b.c) as c_u_u_u This use case is currently solved by spark in this way select
c_u_u_u
from
nested_arrays
lateral view explode(a.b) as c_u
lateral view explode(c_u) as c_u_u
lateral view explode(c_u_u.c) as c_u_u_u The explode approach does also work, but when it comes to solving multiple nested array fields, the writing style is a headache. I know, it is difficult to understand what This pr is a prerequisite for my solution
|
I'm not sure |
I think it's a good idea to add an internal expression to extract nested arrays, but it's only available to internal functions and not directly exposed. I need some time to adapt the code. Thanks for the advice. I will close this now. |
What changes were proposed in this pull request?
For data nest.json
run with
will got error
Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match GetArrayItem, extraction ("c") treat as an ordinal.
Why are the changes needed?
Spark sql cannot analyse nested arrays, it is very common to analyse this type of data, especially in the field of advertising!
Does this PR introduce any user-facing change?
Users can query nested arrays directly using sql and pruning is supported.
How was this patch tested?
Added UT
ComplexTypesSuite
Added tests show that querying nested arrays is possible.
NestArraySchemaPruningSuite
Test extraction of nested arrays while supporting schema pruning.