Skip to content

Fix compatibility quirks between arrow and parquet structs #245

@nevi-me

Description

@nevi-me

Describe the bug

See #246 and 6a65543. There are some notes referring to this issue in that PR.

The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.

The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.

I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.

To Reproduce

  • Create a RecordBatch that has a non-null struct with a nullable child.
  • Write that to Parquet
  • Read the Parquet file with Spark

Expected behavior

There shoulb some clear behaviour that is also documented.

Additional context

See the commit 6a65543, specifically the comments added around the tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions