Skip to content

Support any table nesting level in SQL queries (i.e SELECT * FROM one.two.three.four.five) #13822

@phillipleblanc

Description

@phillipleblanc

Is your feature request related to a problem or challenge?

Currently DataFusion only support queries with 3-levels of nesting, i.e. SELECT * FROM catalog.schema.table. Many catalog providers (i.e. Iceberg) allow any arbitrary level of nesting, i.e.:

.
├── benchmarks
│   └── tpcds
│       ├── foo
│       └── bar
├── spice
│   ├── tpch
│   │   ├── orders
│   │   └── customers
│   ├── info
│   └── extra
│       └── tpch_orders_metadata
└── one
    └── two
        └── three
            └── four
                └── five

Attempting to represent this in DataFusion is tricky and several of the alternatives I considered (see below) are poor UX.

Describe the solution you'd like

I would like to be able to write a catalog provider that allows users to select any of the tables in the Iceberg catalog with a natural dot separated syntax. sqlparser-rs already supports parsing this, creating an ObjectName which is a Vec<Ident>.

There is a function in DataFusion idents_to_table_reference that is responsible for transforming the Vec<Ident> into a TableReference.

Its current implement looks like:

pub(crate) fn idents_to_table_reference(
    idents: Vec<Ident>,
    enable_normalization: bool,
) -> Result<TableReference> {
    let mut taker = IdentTaker::new(idents, enable_normalization);

    match taker.len() {
        1 => {
            let table = taker.take();
            Ok(TableReference::bare(table))
        }
        2 => {
            let table = taker.take();
            let schema = taker.take();
            Ok(TableReference::partial(schema, table))
        }
        3 => {
            let table = taker.take();
            let schema = taker.take();
            let catalog = taker.take();
            Ok(TableReference::full(catalog, schema, table))
        }
        _ => plan_err!(
            "Unsupported compound identifier '{}'. Expected 1, 2 or 3 parts, got {}",
            taker,
            taker.len()
        ),
    }
}

Instead of erroring on >3 idents, I propose that we concatenate all of the "middle" namespaces into the schema part:

        // This won't compile as is, but demonstrates the idea
        _ => {
            // Concatenate all of the middle identifiers with a dot and put into the "schema" field
            let table = taker.take(enable_normalization);
            let schema = idents
                .iter()
                .skip(1)
                .take(idents.len() - 2)
                .map(|ident| {
                    IdentNormalizer::new(enable_normalization).normalize(ident.clone())
                })
                .collect::<Vec<_>>()
                .join(".");
            let catalog = taker.take(enable_normalization);
            Ok(TableReference::full(catalog, schema, table))
        }

That would allow us to do SELECT * FROM one.two.three.four.five and have it be converted into a DF TableReference with:
catalog: one
schema: two.three.four
table: five

And any catalog provider implementations can know to create "schemas" to match this format to support arbitrarily nested namespaces.

Describe alternatives you've considered

I've considered in my implementation when integrating with a catalog provider to concatenate the middle namespaces with a _ in the schema part so that SELECT * FROM one.two.three.four.five would become SELECT * FROM one.two_three_four.five.

I've also considered concatenating the middle namspaces with . and not changing DataFusion, but that would require doing SELECT * FROM one."two.three.four".five which is also not an ideal UX.

Another alternative if we didn't want this to be the default, is to allow users to customize the behavior of the object_name_to_table_reference function somehow.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions