Skip to content

[SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct #30467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

guiyanakuang
Copy link
Member

What changes were proposed in this pull request?

For data nest.json

{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}

run with

val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

will got error

org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;
'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]
+- SubqueryAlias `nest_table`
+- Relationa#6 json

Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match GetArrayItem, extraction ("c") treat as an ordinal.

Why are the changes needed?

Spark sql cannot analyse nested arrays, it is very common to analyse this type of data, especially in the field of advertising!

Does this PR introduce any user-facing change?

Users can query nested arrays directly using sql and pruning is supported.

How was this patch tested?

Added UT

ComplexTypesSuite
Added tests show that querying nested arrays is possible.

NestArraySchemaPruningSuite
Test extraction of nested arrays while supporting schema pruning.

@github-actions github-actions bot added the SQL label Nov 23, 2020
@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @viirya and @gatorsmile

@maropu
Copy link
Member

maropu commented Nov 24, 2020

Spark sql cannot analyse nested arrays, it is very common to analyse this type of data, especially in the field of advertising!

I remember the related discussion in #29645 Btw, other systems support this resolution?

@viirya
Copy link
Member

viirya commented Nov 24, 2020

Hmm, yeah, I remember the discussion with @cloud-fan. IIUC, it is not supported explicitly in Spark.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131616 has finished for PR 30467 at commit 6cfa034.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@guiyanakuang
Copy link
Member Author

I remember the related discussion in #29645 Btw, other systems support this resolution?

Google BigQuery and Amazon Athena are supposed to support nested arrays, but instead of analysing the nested data directly, they use utdf to unfold the data for analysis. Like this

select  c_column, d_column from table lateral view unnest(a.b.c, a.b.d) as c_column, d_column

However, the spark sql syntax analysis phase is exceptional because no extractor for nested arrays could be found.

@guiyanakuang
Copy link
Member Author

Hmm, yeah, I remember the discussion with @cloud-fan. IIUC, it is not supported explicitly in Spark.

Does this mean I need to close the pr?

@viirya
Copy link
Member

viirya commented Nov 24, 2020

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131680 has finished for PR 30467 at commit b18c03a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@guiyanakuang, can we make it matched the examples you mentioned? Also, just to double check, we don't have any reference for the support nested array support, right?

@guiyanakuang
Copy link
Member Author

guiyanakuang commented Dec 7, 2020

@guiyanakuang, can we make it matched the examples you mentioned? Also, just to double check, we don't have any reference for the support nested array support, right?

@HyukjinKwon Are you referring to this example ?

select  c_column, d_column from table lateral view unnest(a.b.c, a.b.d) as c_column, d_column

Yes, we can implement this example, it's just that Unnest utdf does not yet exist for spark, and we can use explore instead.

val jsonStr1 = """{"a": [{"b": [{"c": [1,2]}]}]}"""
val jsonStr2 = """{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}"""
val df = spark.read.json(Seq(jsonStr1, jsonStr2).toDS())
df.createTempView("nested_arrays")
sql(
  """
    |select
    |    c_u_u_u
    |from
    |    nested_arrays
    |    lateral view explode(a.b.c) as c_u
    |    lateral view explode(c_u)   as c_u_u
    |    lateral view explode(c_u_u) as c_u_u_u
    |""".stripMargin).show()

+-------+
|c_u_u_u|
+-------+
|      1|
|      2|
|      1|
|      2|
+-------+

In fact, my company, Tencent, has implemented unnest in Spark, which is also referenced in Google's BigQuery.
Our advertising analysts rely heavily on it.

@cloud-fan
Copy link
Contributor

I don't think this is the right direction, as people will ask for more in the future, like supporting 3-level nested array. At the end, you can't even tell what a.b means, as a can be an arbitrarily nested array. I like the unnest from BigQuery as it has a clear semantic. The explode approach also works so I don't see why we need this feature. I'm fine to add unnest to Spark as an alternative to solve this use case.

@guiyanakuang
Copy link
Member Author

I would like to explain a little more
I also like the clearly defined unnest, which addresses this use case in my company

select c_u_u_u from nested_arrays lateral view unnest(a.b.c) as c_u_u_u

This use case is currently solved by spark in this way

select
    c_u_u_u
from
    nested_arrays 
    lateral view explode(a.b) as c_u 
    lateral view explode(c_u) as c_u_u
    lateral view explode(c_u_u.c) as c_u_u_u

The explode approach does also work, but when it comes to solving multiple nested array fields, the writing style is a headache. I know, it is difficult to understand what a.b/a.b.c means when a is a nested array. The good news is that analysts just need the values of all the leaves and unnest(a.b.c) makes it easy for us to solve the problem without having to understand the middle structure.

This pr is a prerequisite for my solution

  1. This pr implements the corresponding extractor, which supports nesting at any level, not just 3. I understand that this flexibility can lead to ambiguity and I would not recommend using a.b.c directly, but in combination with unnest.

  2. This pr can be combined very well with pushdown. In the current way only a.b can be extracted using GetArrayStructFields, which filters out the c brothers in memory. Leads to amplification of read data io. This pr guarantees that only the corresponding data will be read, based on Dremel storage, similar to what can be done with parquet.

@cloud-fan
Copy link
Contributor

I'm not sure a.b can be pushed down if a is a deeply nested array. I'm not against adding a new internal expression to extract field from nested array, if it's needed by unnest. I'm -1 to overload a.b to support arbitrary levels of nested arrays.

@guiyanakuang
Copy link
Member Author

I think it's a good idea to add an internal expression to extract nested arrays, but it's only available to internal functions and not directly exposed. I need some time to adapt the code. Thanks for the advice. I will close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants