Skip to content

[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc)  #13456

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I personally think making it easy to use DataFusion with the "open data lake" stack is very important over the next few months.

@julienledem wrote up a very nice piece describing The advent of the Open Data Lake

The high level idea is to make it really easy for people to build systems that query (quickly!) from parquet files stored on remote object store, including Apache Iceberg, Delta Lake, Hudi, etc.

You can already use DataFusion (and datafusion-cli) to query such data, but it takes non trivial effort to configure and tune for good performance. My idea is to make it easier to do so / make DataFusion better out of the box.

With that as a building block, people could/would build applications and systems targeting specific usecases

I don't yet fully understand where we currently stand on this goal, but I wanted to start hte discussio

Describe the solution you'd like

In my mind, the specific work this entails stuff like

Describe alternatives you've considered

One specific item, brought up by @MrPowers would be to try DataFusion with the "10B row challenge" described in https://dataengineeringcentral.substack.com/p/10-billion-row-challenge-duckdb-vs .

I suspect it would be non ideal at first, but trying it to figure out what the challenges are would help us focus our efforts

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions