-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
See #246 and 6a65543. There are some notes referring to this issue in that PR.
The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.
The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.
I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.
To Reproduce
- Create a RecordBatch that has a non-null struct with a nullable child.
- Write that to Parquet
- Read the Parquet file with Spark
Expected behavior
There shoulb some clear behaviour that is also documented.
Additional context
See the commit 6a65543, specifically the comments added around the tests.