You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled
## What changes were proposed in this pull request?
As part of schema clipping in `ParquetReadSupport.scala`, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to `ParquetRecordMaterializer`: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two.
Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file.
Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the `ReadContext`.
_Additional description (by HyukjinKwon):_
Let's suppose that we have a Parquet schema as below:
```
message spark_schema {
required int32 id;
optional group name {
optional binary first (UTF8);
optional binary last (UTF8);
}
optional binary address (UTF8);
}
```
Currently, the clipped schema as follows:
```
message spark_schema {
optional group name {
optional binary middle (UTF8);
}
optional binary address (UTF8);
}
```
Parquet MR does not support access to the nested non-existent field (`name.middle`).
To workaround this, this PR removes `name.middle` request at all to Parquet reader as below:
```
Parquet requested schema:
message spark_schema {
optional binary address (UTF8);
}
```
and produces the record (`name.middle`) properly as the requested Catalyst schema.
```
root
-- name: struct (nullable = true)
|-- middle: string (nullable = true)
-- address: string (nullable = true)
```
I think technically this is what Parquet library should support since Parquet library made a design decision to produce `null` for non-existent fields IIRC. This PR targets to work around it.
## How was this patch tested?
A previously ignored test case which exercises the failure scenario this PR addresses has been enabled.
This closes#22880Closes#24307 from dongjoon-hyun/SPARK-25407.
Lead-authored-by: Michael Allman <msa@allman.ms>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
0 commit comments