Skip to content

[DISCUSSION] Challenge: Make DataFusion the fastest engine in ClickBench with custom file format #13448

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This is a crazy idea

Now that DataFusion is the fastest engine for Parquet in ClickBench

A natural follow-on question is “what would it take to make it the fastest overall engine”?

Describe the solution you'd like

TLDR is that I think it needs a special file format. A custom format is fine and consistent with other systems in ClickBench which use various proprietary formats.

So, as a fascinating experiment / academic project, someone could be design / hack up a “ClickBench” file format and DataFusion TableProvider, specifically designed for getting the fastest ClickBench results.

While I suspect this format would not be particularly general purpose, I think it would show How easy it is to make custom formats for particular use cases with DataFusion (don’t have to worry about all the rest of the query engine machinery)

Describe alternatives you've considered

No response

Additional context

This was inspired by @pauldix talking about using DataFusion to innovate at “the edges” of database design https://twitter.com/pauldix/status/1855330035974160483

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions