Skip to content

[SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs #37833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Sep 8, 2022

What changes were proposed in this pull request?

This PR fixes an issue in arrays_zip function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index.

Example:

with q as (
  select
    named_struct(
      'my_array', array(1, 2, 3),
      'my_array2', array(4, 5, 6)
    ) as my_struct
)
select
  arrays_zip(my_struct.my_array, my_struct.my_array2)
from
  q

would return schema:

root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- 0: integer (nullable = true)
 |    |    |-- 1: integer (nullable = true)

which is somewhat inaccurate. PR adds handling of GetStructField expression to return the struct field names like this:

root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- my_array: integer (nullable = true)
 |    |    |-- my_array2: integer (nullable = true)

Why are the changes needed?

Does this PR introduce any user-facing change?

Yes, arrays_zip function returns struct field names now as in Spark 3.1 instead of field indices.
Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour.

How was this patch tested?

Existing unit tests. I also added a test case that reproduces the problem.

@sadikovi sadikovi changed the title [SPARK-40292] Fix column names in "arrays_zip" function when arrays are referenced from nested structs [SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs Sep 8, 2022
@github-actions github-actions bot added the SQL label Sep 8, 2022

val fieldNames = res.schema.head.dataType.asInstanceOf[ArrayType]
.elementType.asInstanceOf[StructType].fieldNames
assert(fieldNames.toSeq === Seq("arr_1", "arr_2", "arr_3"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the fix, this would return Seq("0", "1", "2").

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@MaxGekk
Copy link
Member

MaxGekk commented Sep 10, 2022

which was a regression from Spark 3.1

@sadikovi To where should we backport this? Up to branch-3.1 inclusively?

@sadikovi
Copy link
Contributor Author

Thanks @MaxGekk. Spark 3.1 produces the correct result so I think we can just backport to Spark 3.2 and Spark 3.3.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 12, 2022

+1, LGTM. Merging to master/3.3/3.2.
Thank you, @sadikovi.

MaxGekk pushed a commit that referenced this pull request Sep 12, 2022
…ays are referenced from nested structs

### What changes were proposed in this pull request?

This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index.

Example:
```sql
with q as (
  select
    named_struct(
      'my_array', array(1, 2, 3),
      'my_array2', array(4, 5, 6)
    ) as my_struct
)
select
  arrays_zip(my_struct.my_array, my_struct.my_array2)
from
  q
```

would return schema:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- 0: integer (nullable = true)
 |    |    |-- 1: integer (nullable = true)
```

which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- my_array: integer (nullable = true)
 |    |    |-- my_array2: integer (nullable = true)
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices.
Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour.

### How was this patch tested?

Existing unit tests. I also added a test case that reproduces the problem.

Closes #37833 from sadikovi/SPARK-40292.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 443eea9)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk MaxGekk closed this in 443eea9 Sep 12, 2022
MaxGekk pushed a commit that referenced this pull request Sep 12, 2022
…ays are referenced from nested structs

### What changes were proposed in this pull request?

This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index.

Example:
```sql
with q as (
  select
    named_struct(
      'my_array', array(1, 2, 3),
      'my_array2', array(4, 5, 6)
    ) as my_struct
)
select
  arrays_zip(my_struct.my_array, my_struct.my_array2)
from
  q
```

would return schema:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- 0: integer (nullable = true)
 |    |    |-- 1: integer (nullable = true)
```

which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- my_array: integer (nullable = true)
 |    |    |-- my_array2: integer (nullable = true)
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices.
Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour.

### How was this patch tested?

Existing unit tests. I also added a test case that reproduces the problem.

Closes #37833 from sadikovi/SPARK-40292.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 443eea9)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
HyukjinKwon pushed a commit that referenced this pull request Sep 16, 2022
…rays_zip" function

### What changes were proposed in this pull request?

This is a follow-up for #37833.

The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details).

Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example.

### Why are the changes needed?

Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests that reproduce the issue and confirmed that the patch fixes them.

Closes #37911 from sadikovi/SPARK-40470.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Sep 16, 2022
…rays_zip" function

### What changes were proposed in this pull request?

This is a follow-up for #37833.

The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details).

Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example.

### Why are the changes needed?

Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests that reproduce the issue and confirmed that the patch fixes them.

Closes #37911 from sadikovi/SPARK-40470.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9b0f979)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Sep 16, 2022
…rays_zip" function

### What changes were proposed in this pull request?

This is a follow-up for #37833.

The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details).

Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example.

### Why are the changes needed?

Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests that reproduce the issue and confirmed that the patch fixes them.

Closes #37911 from sadikovi/SPARK-40470.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9b0f979)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…rays_zip" function

### What changes were proposed in this pull request?

This is a follow-up for apache#37833.

The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details).

Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example.

### Why are the changes needed?

Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests that reproduce the issue and confirmed that the patch fixes them.

Closes apache#37911 from sadikovi/SPARK-40470.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…ays are referenced from nested structs

### What changes were proposed in this pull request?

This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index.

Example:
```sql
with q as (
  select
    named_struct(
      'my_array', array(1, 2, 3),
      'my_array2', array(4, 5, 6)
    ) as my_struct
)
select
  arrays_zip(my_struct.my_array, my_struct.my_array2)
from
  q
```

would return schema:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- 0: integer (nullable = true)
 |    |    |-- 1: integer (nullable = true)
```

which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this:
```
root
 |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- my_array: integer (nullable = true)
 |    |    |-- my_array2: integer (nullable = true)
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices.
Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour.

### How was this patch tested?

Existing unit tests. I also added a test case that reproduces the problem.

Closes apache#37833 from sadikovi/SPARK-40292.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 443eea9)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…rays_zip" function

### What changes were proposed in this pull request?

This is a follow-up for apache#37833.

The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details).

Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example.

### Why are the changes needed?

Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests that reproduce the issue and confirmed that the patch fixes them.

Closes apache#37911 from sadikovi/SPARK-40470.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9b0f979)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants