-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
Currently DataFusion only support queries with 3-levels of nesting, i.e. SELECT * FROM catalog.schema.table. Many catalog providers (i.e. Iceberg) allow any arbitrary level of nesting, i.e.:
.
├── benchmarks
│ └── tpcds
│ ├── foo
│ └── bar
├── spice
│ ├── tpch
│ │ ├── orders
│ │ └── customers
│ ├── info
│ └── extra
│ └── tpch_orders_metadata
└── one
└── two
└── three
└── four
└── fiveAttempting to represent this in DataFusion is tricky and several of the alternatives I considered (see below) are poor UX.
Describe the solution you'd like
I would like to be able to write a catalog provider that allows users to select any of the tables in the Iceberg catalog with a natural dot separated syntax. sqlparser-rs already supports parsing this, creating an ObjectName which is a Vec<Ident>.
There is a function in DataFusion idents_to_table_reference that is responsible for transforming the Vec<Ident> into a TableReference.
Its current implement looks like:
pub(crate) fn idents_to_table_reference(
idents: Vec<Ident>,
enable_normalization: bool,
) -> Result<TableReference> {
let mut taker = IdentTaker::new(idents, enable_normalization);
match taker.len() {
1 => {
let table = taker.take();
Ok(TableReference::bare(table))
}
2 => {
let table = taker.take();
let schema = taker.take();
Ok(TableReference::partial(schema, table))
}
3 => {
let table = taker.take();
let schema = taker.take();
let catalog = taker.take();
Ok(TableReference::full(catalog, schema, table))
}
_ => plan_err!(
"Unsupported compound identifier '{}'. Expected 1, 2 or 3 parts, got {}",
taker,
taker.len()
),
}
}Instead of erroring on >3 idents, I propose that we concatenate all of the "middle" namespaces into the schema part:
// This won't compile as is, but demonstrates the idea
_ => {
// Concatenate all of the middle identifiers with a dot and put into the "schema" field
let table = taker.take(enable_normalization);
let schema = idents
.iter()
.skip(1)
.take(idents.len() - 2)
.map(|ident| {
IdentNormalizer::new(enable_normalization).normalize(ident.clone())
})
.collect::<Vec<_>>()
.join(".");
let catalog = taker.take(enable_normalization);
Ok(TableReference::full(catalog, schema, table))
}That would allow us to do SELECT * FROM one.two.three.four.five and have it be converted into a DF TableReference with:
catalog: one
schema: two.three.four
table: five
And any catalog provider implementations can know to create "schemas" to match this format to support arbitrarily nested namespaces.
Describe alternatives you've considered
I've considered in my implementation when integrating with a catalog provider to concatenate the middle namespaces with a _ in the schema part so that SELECT * FROM one.two.three.four.five would become SELECT * FROM one.two_three_four.five.
I've also considered concatenating the middle namspaces with . and not changing DataFusion, but that would require doing SELECT * FROM one."two.three.four".five which is also not an ideal UX.
Another alternative if we didn't want this to be the default, is to allow users to customize the behavior of the object_name_to_table_reference function somehow.
Additional context
No response