Skip to content

ParquetRecordBatchStream Should Return the Projected Schema #4023

@msalib

Description

@msalib

Describe the bug

Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:

ParquetRecordBatchStream::schema will produce a Schema object that includes that metadata.
But ParquetRecordBatchStream will yield RecordBatches that have schema objects that don't have the metadata.

The problem is that if you create an ArrowWriter using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).

Expected behavior

I'd expect that either:

  • ParquetRecordBatchStream::schema produces a Schema without metadata, or
  • the RecordBatches produced by ParquetRecordBatchStream have the exact same schema as what ::schema returns, or
  • ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions