Skip to content

Loading from Arrow files #5594

@andrewthad

Description

@andrewthad

This is not necessarily a feature request. Rather, it's a request for either a feature to be added or for improved documentation clarifying that the feature is not available. To my understanding, datafusion (at the CLI at least) cannot read from Apache arrow files. There are two different kinds of Arrow files: the .arrow file (which has a footer with metadata about block positions) and the .arrows "streaming" file (which lacks the footer). I've tried out several CREATE EXTERNAL TABLE invocations:

CREATE EXTERNAL TABLE foo stored as ARROW LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as ARROWS LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as FEATHER LOCATION foo.arrow

They all give an "Unable to find factory for ..." error. After looking through more of the documentation for a while and paying attention to what wasn't explicitly said, I realized that arrow files are not support as a form of input. I think that, if this is the case, it should be mentioned explicitly in the documentation. Datafusion's documentation is misleading about arrow being an internal implementation detail, not an external-facing way to communicate with a producer of data. From the readme on GitHub:

Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem

I read this as meaning the datafusion can consume data in the arrow format. Flight is a tool specifically for the purpose of shuffling arrow-formatted data around on a network, so it's hard to interpret this as meaning anything else. Perhaps this was a goal at some point, or maybe it's possible to do this, but it's undocumented.

Here are three mutually exclusive possibilities for improving this situation:

  • Document that arrow files are not supported.
  • Document that arrow files are supported (maybe they are and I couldn't figure it out!)
  • Support arrow files as a source of data

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions