-
Couldn't load subscription status.
- Fork 3.9k
ARROW-12037: [Rust] [DataFusion] Support catalogs and schemas for table namespacing #9762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
All tests passing with fallible table management
|
This is truly awesome @returnString ! |
|
Thanks @returnString and I appreciate the well-written design doc to explain the PR. I didn't go through the code in great detail but I like the design. I noted a couple of unwraps in the code. It would be good to document why they are safe or consider having those methods return |
|
Thanks both :)
I've just audited all the unwraps that are introduced in these commits, and they fall into two categories:
I think the quick hack (context: retrieving a list of table names from the execution ctx, assuming a flat namespace) can probably be removed outright, as As for tests, yes, the coverage so far is basically "didn't break anything" and "all three types of reference to a given table in the default catalog/schema resolve correctly", so I'll try and plan out some more in-depth testing covering e.g. tables of the same name in multiple schemas/catalogs. |
Codecov Report
@@ Coverage Diff @@
## master #9762 +/- ##
========================================
Coverage 82.59% 82.59%
========================================
Files 248 252 +4
Lines 58294 59004 +710
========================================
+ Hits 48149 48737 +588
- Misses 10145 10267 +122
Continue to review full report at Codecov.
|
|
Whilst I still want to try and conjure up a few more test cases, I think this PR is now in a state where I'm happy to receive actual code reviews, so I've removed the draft status :) |
|
I plan to review this PR later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all this looks great. Thank you @returnString !
I went through it fairly carefully and I have some suggestions but nothing that would prevent us from merging this in my opinion.
| start.elapsed().as_millis() | ||
| ); | ||
| ctx.register_table(table, Arc::new(memtable)); | ||
| ctx.register_table(*table, Arc::new(memtable))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to make register_table fallible is a breaking change, but a very reasonable one I think.
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Describes the interface and built-in implementations of catalogs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| fn as_any(&self) -> &dyn Any; | ||
|
|
||
| /// Retrieves the list of available schema names in this catalog. | ||
| fn schema_names(&self) -> Vec<String>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about returning something more like Vec<&str> to prevent requiring a copy?
Ideally it would be nice if we could do something like
fn schema_names(&self) -> impl Iterator<Item=&str>
As all uses of the results here will need to iterate over the names I suspect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I iniitally tried to implement these returning &[&str] but with the threading requirements I struggled to get anything working that way. I think this would require a larger refactoring to enable, but I'm not 100% sure on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec<&str> might be possible
| } | ||
|
|
||
| /// Simple in-memory implementation of a catalog. | ||
| pub struct MemoryCatalogProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some other PR perhaps we can put the concrete implementations into their own modules. I don't think this one is big enough to warrant that yet, however; I just wanted to point it out
| impl<'a> TryFrom<&'a sqlparser::ast::ObjectName> for TableReference<'a> { | ||
| type Error = DataFusionError; | ||
|
|
||
| fn try_from(value: &'a sqlparser::ast::ObjectName) -> Result<Self, Self::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
|
|
||
| /// Selects a name for the default catalog and schema | ||
| pub fn with_default_catalog_and_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this API. 👍
I wonder if we need an API for getting default_catalog and default_schema?
| df.collect().await.expect_err("query not supported"); | ||
| Ok(()) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are excellent tests . 👍
| &mut ctx, | ||
| "SELECT cat, SUM(i) AS total FROM ( | ||
| SELECT i, 'a' AS cat FROM catalog_a.schema_a.table_a | ||
| UNION ALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan UNION is already being used!
| for table_ref in &[ | ||
| "nonexistentschema.aggregate_test_100", | ||
| "nonexistentcatalog.public.aggregate_test_100", | ||
| "way.too.many.namespaces.as.ident.prefixes.aggregate_test_100", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
@returnString do you plan anything else for this PR? I think it is ready to merge but I wanted to check to see if you think there is anything else left to do? |
I was looking to see if I could improve the test setup any further, but on reflection I'm actually pretty happy with the state of things; I don't think we're missing any use cases in the coverage (at least, ones I can conceive of right now). So from my side, happy with things as they are :) |
|
Thanks again @returnString ! This is a great feature |
This is an implementation of catalog and schema providers to support table namespacing (see the design doc).
I'm creating this draft PR as a supporting implementation for the proposal, to prove out that the work can be done whilst minimising API churn and still allowing for use cases that don't care at all about the notion of catalogs or schemas; in this new setup, the default namespace is
datafusion.public, which will be created automatically with the default execution context config and allow for table registration.Highlights
CatalogProviderandSchemaProviderbacked by hashmapsTableReferenceenum maps to various ways of referring to a table in sqlmy_tableschema.my_tablecatalog.schema.my_tableTableReferenceinstances of any variant can be converted to aResolvedTableReference, which always include all three components