Skip to content

Conversation

@returnString
Copy link
Contributor

@returnString returnString commented Mar 21, 2021

This is an implementation of catalog and schema providers to support table namespacing (see the design doc).

I'm creating this draft PR as a supporting implementation for the proposal, to prove out that the work can be done whilst minimising API churn and still allowing for use cases that don't care at all about the notion of catalogs or schemas; in this new setup, the default namespace is datafusion.public, which will be created automatically with the default execution context config and allow for table registration.

Highlights

  • Datasource map removed in execution context state, replaced with catalog map
  • Execution context allows for registering new catalog providers
  • Catalog providers can be queried for their constituent schema providers
  • Schema providers can be queried for table providers, similarly to the old datasource map
  • Includes basic implementations of CatalogProvider and SchemaProvider backed by hashmaps
  • New TableReference enum maps to various ways of referring to a table in sql
    • Bare: my_table
    • Partial: schema.my_table
    • Full: catalog.schema.my_table
  • Given a default catalog and schema, TableReference instances of any variant can be converted to a ResolvedTableReference, which always include all three components

@github-actions
Copy link

@Dandandan
Copy link
Contributor

This is truly awesome @returnString !
Apart from some tests & remaining cleanup (e.g. of the "quick hack") I really like the direction this is going!

@andygrove
Copy link
Member

Thanks @returnString and I appreciate the well-written design doc to explain the PR. I didn't go through the code in great detail but I like the design. I noted a couple of unwraps in the code. It would be good to document why they are safe or consider having those methods return Result if they are not safe.

@returnString
Copy link
Contributor Author

returnString commented Mar 21, 2021

Thanks both :)

I noted a couple of unwraps in the code. It would be good to document why they are safe or consider having those methods return Result if they are not safe.

I've just audited all the unwraps that are introduced in these commits, and they fall into two categories:

  • test cases that aren't already set up to return Result<()>
  • unwrapping guard objects for Mutex/RwLock - afaik these only error if the lock is poisoned and this usage is consistent with existing lock usage e.g. state access in the ExecutionContext

Apart from some tests & remaining cleanup (e.g. of the "quick hack") I really like the direction this is going!

I think the quick hack (context: retrieving a list of table names from the execution ctx, assuming a flat namespace) can probably be removed outright, as ctx.catalog("my_db")?.schema("my_schema")?.table_names() would be a much more explicit way of doing this going forward. As far as I can tell, ExecutionContext::tables isn't referenced at all in the codebase internally, but this would introduce a bit more API breakage.

As for tests, yes, the coverage so far is basically "didn't break anything" and "all three types of reference to a given table in the default catalog/schema resolve correctly", so I'll try and plan out some more in-depth testing covering e.g. tables of the same name in multiple schemas/catalogs.

@codecov-io
Copy link

codecov-io commented Mar 21, 2021

Codecov Report

Merging #9762 (03048c6) into master (29feea0) will increase coverage by 0.00%.
The diff coverage is 80.36%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #9762    +/-   ##
========================================
  Coverage   82.59%   82.59%            
========================================
  Files         248      252     +4     
  Lines       58294    59004   +710     
========================================
+ Hits        48149    48737   +588     
- Misses      10145    10267   +122     
Impacted Files Coverage Δ
rust/datafusion/examples/flight_server.rs 0.00% <0.00%> (ø)
rust/datafusion/examples/simple_udaf.rs 0.00% <ø> (ø)
rust/datafusion/src/datasource/datasource.rs 100.00% <ø> (ø)
rust/datafusion/src/datasource/memory.rs 85.15% <ø> (ø)
rust/datafusion/src/execution/dataframe_impl.rs 89.10% <0.00%> (-1.03%) ⬇️
rust/datafusion/src/optimizer/constant_folding.rs 92.30% <0.00%> (-0.36%) ⬇️
...datafusion/src/optimizer/hash_build_probe_order.rs 53.60% <0.00%> (-1.72%) ⬇️
...t/datafusion/src/optimizer/projection_push_down.rs 98.66% <ø> (ø)
rust/datafusion/src/physical_plan/mod.rs 88.00% <ø> (ø)
rust/parquet/src/basic.rs 88.59% <ø> (+1.36%) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81d6724...03048c6. Read the comment docs.

@returnString returnString marked this pull request as ready for review March 21, 2021 22:02
@returnString
Copy link
Contributor Author

Whilst I still want to try and conjure up a few more test cases, I think this PR is now in a state where I'm happy to receive actual code reviews, so I've removed the draft status :)

@alamb
Copy link
Contributor

alamb commented Mar 22, 2021

I plan to review this PR later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all this looks great. Thank you @returnString !

I went through it fairly carefully and I have some suggestions but nothing that would prevent us from merging this in my opinion.

start.elapsed().as_millis()
);
ctx.register_table(table, Arc::new(memtable));
ctx.register_table(*table, Arc::new(memtable))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to make register_table fallible is a breaking change, but a very reasonable one I think.

// specific language governing permissions and limitations
// under the License.

//! Describes the interface and built-in implementations of catalogs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fn as_any(&self) -> &dyn Any;

/// Retrieves the list of available schema names in this catalog.
fn schema_names(&self) -> Vec<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about returning something more like Vec<&str> to prevent requiring a copy?

Ideally it would be nice if we could do something like

fn schema_names(&self) -> impl Iterator<Item=&str>

As all uses of the results here will need to iterate over the names I suspect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I iniitally tried to implement these returning &[&str] but with the threading requirements I struggled to get anything working that way. I think this would require a larger refactoring to enable, but I'm not 100% sure on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec<&str> might be possible

}

/// Simple in-memory implementation of a catalog.
pub struct MemoryCatalogProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some other PR perhaps we can put the concrete implementations into their own modules. I don't think this one is big enough to warrant that yet, however; I just wanted to point it out

impl<'a> TryFrom<&'a sqlparser::ast::ObjectName> for TableReference<'a> {
type Error = DataFusionError;

fn try_from(value: &'a sqlparser::ast::ObjectName) -> Result<Self, Self::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

/// Selects a name for the default catalog and schema
pub fn with_default_catalog_and_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this API. 👍

I wonder if we need an API for getting default_catalog and default_schema?

df.collect().await.expect_err("query not supported");
Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are excellent tests . 👍

&mut ctx,
"SELECT cat, SUM(i) AS total FROM (
SELECT i, 'a' AS cat FROM catalog_a.schema_a.table_a
UNION ALL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan UNION is already being used!

for table_ref in &[
"nonexistentschema.aggregate_test_100",
"nonexistentcatalog.public.aggregate_test_100",
"way.too.many.namespaces.as.ident.prefixes.aggregate_test_100",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Mar 23, 2021

@returnString do you plan anything else for this PR? I think it is ready to merge but I wanted to check to see if you think there is anything else left to do?

@returnString
Copy link
Contributor Author

@returnString do you plan anything else for this PR? I think it is ready to merge but I wanted to check to see if you think there is anything else left to do?

I was looking to see if I could improve the test setup any further, but on reflection I'm actually pretty happy with the state of things; I don't think we're missing any use cases in the coverage (at least, ones I can conceive of right now). So from my side, happy with things as they are :)

@alamb alamb closed this in 14441db Mar 23, 2021
@alamb
Copy link
Contributor

alamb commented Mar 23, 2021

Thanks again @returnString ! This is a great feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants