`ParquetRecordBatchStream` Should Return the Projected Schema

**Describe the bug**

Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:

`ParquetRecordBatchStream::schema` will produce a `Schema` object that includes that metadata.
But `ParquetRecordBatchStream` will yield `RecordBatch`es that have schema objects that don't have the metadata.

The problem is that if you create an  `ArrowWriter` using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).

**Expected behavior**

I'd expect that either:
* `ParquetRecordBatchStream::schema` produces a `Schema` without metadata, or
* the `RecordBatch`es produced by `ParquetRecordBatchStream` have the exact same schema as what `::schema` returns, or
* `ArrowWriter` should tolerate its supplied schema differing from the batch schemas provided to `write()` in metadata


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ParquetRecordBatchStream` Should Return the Projected Schema #4023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ParquetRecordBatchStream Should Return the Projected Schema #4023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ParquetRecordBatchStream` Should Return the Projected Schema #4023