Skip to content

Avoid reading entire stream to determine schema of arrow file #6368

@jonmmease

Description

@jonmmease

Follow on to #6337.

Currently when reading an arrow file from a stream, the entire stream is parsed as a file in order to determine the schema:

https://github.com/apache/arrow-datafusion/blob/8a47c42096311cf9b6191cfb9d96e2d9ba3a630d/datafusion/core/src/datasource/file_format/arrow.rs#L60-L63

This will result in parsing the stream multiple times (once to determine the schema and again later to actually build RecordBatches from the stream).

Can we be more efficient here by only looking as far into the stream as necessary to read the schema?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions