Skip to content

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

@wiedld

Description

@wiedld

Describe the bug

We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.

To Reproduce

  1. Write parquet with ParquetSink.
  2. Write parquet with ArrowWriter (default options).
  3. Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:

let file_metadata: FileMetadata = <get from file per API>;

let arrow_schema = parquet_to_arrow_schema(
file_metadata.schema_descr(),
file_metadata.key_value_metadata(),
);

  1. An error is returned for parquet written by ParquetSink.

Expected behavior

Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions