Describe the bug
We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.
To Reproduce
- Write parquet with ParquetSink.
- Write parquet with ArrowWriter (default options).
- Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:
let file_metadata: FileMetadata = <get from file per API>;
let arrow_schema = parquet_to_arrow_schema(
file_metadata.schema_descr(),
file_metadata.key_value_metadata(),
);
- An error is returned for parquet written by ParquetSink.
Expected behavior
Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.
Additional context
No response