-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Custom / Dynamic table provider factories #3311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@andygrove @alamb @houqp This was a quick and dirty implementation, but I thought I would throw it out there to collect your thoughts on it. If something like this is deemed acceptable and merged, we'd extend Ballista to use it as well, then either:
|
datafusion/sql/src/parser.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I'd like to un-hard-code the built-in tables and just have a ListingTableFactory in the map so all these behave the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd love to use the postgres with options syntax to pass an arbitrary hashmap for credentials, host & port, etc.
https://www.postgresql.org/docs/current/sql-createforeigntable.html
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
{ column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ]
On second thought, credentials could be controversial... forget I said that :)
|
I think this looks like a great start. Thanks @avantgardnerio |
|
Oh, also related to #2025 |
|
I guess I should mention @matthewmturner since I linked that ticket. |
|
@avantgardnerio This is very cool. Unfortunately I haven't been able to work on datafusion lately but I could definitely see integrating this into datafusion tui again when I start again. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also really like where this PR is heading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is very cool -- I like it a lot 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, affirmation of general direction was what I was looking for, I'll clean it up and get it submitted, ty!
Codecov Report
@@ Coverage Diff @@
## master #3311 +/- ##
==========================================
+ Coverage 85.48% 85.51% +0.02%
==========================================
Files 294 294
Lines 54115 54120 +5
==========================================
+ Hits 46259 46279 +20
+ Misses 7856 7841 -15
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- thank you @avantgardnerio
It does contain the changes from #3333 as well, so I think we should probably merge #3333 first and then rebase this one.
I had several small comments / style suggestions, but overall I think this PR is basically ready to go. I didn't "approve" it as it has the #3333 changes in it as well and wanted to give you a chance to respond to suggestions
Something I suggest as well is a datafusion-example in https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples/examples
The rationale for doing so is:
- Serves as end-to-end test (that uses DataFusion's public API only) so we don't inadvertently break something if code is moved around (we did that in the past when we moved something and it became non-
pub🤦 ) as the scoping / visibility rules are different in crate / out of crate. - Serves as a documentation / starting point for others to actually use it.
Such an end to end can be done as a follow on PR (or we could just file a ticket to track the work)
.github/workflows/rust.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a left over from #3333 -- but I also don't think it hurts to leave it in this PR
datafusion/proto/src/logical_plan.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to avoid having these four strings hardcoded in several places, so that if we add another format, we don't have to make changes all over the place.
Maybe we could put it into a struct like BuiltInTableProviders or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree, but I was hoping to do it in a follow-on PR, as there seems to be a lot of action going on with TableProviders right now, and I was hoping to not keep this one long-standing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential way to write this (and maybe reduce the diff) would be to add a new enum variant like "Dynamic":
pub enum FileType {
/// Newline-delimited JSON
NdJson,
...
/// Custom factory
Dynamic(String),
}This strategy might be more "idiomatic" as it encodes more information in the type system rather than a String
I don't feel strongly about this, I just wanted to offer it as an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with that solution, but I think it makes more sense to go:
- enum (as it was)
- string (with hard-coded values)
- everything-as-a-TableProviderFactory (string with no hard-coded values)
If this PR is accepted, I will move immediately onto step #3.
319619b to
761f384
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @avantgardnerio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should create be async?
The TableProvider will have to know the Schema which will most likely have to be inferred by reading from the ObjectStore (or involve a network call otherwise if there is some sort of metastore involved).
|
Benchmark runs are scheduled for baseline = b175f9a and contender = bb08d31. bb08d31 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3310.
Rationale for this change
I would like to allow users to register custom table types at runtime.
What changes are included in this PR?
TableProviderFactoryFileType::Custom(String)to allow the parser to pass information about these tablesHashMapofTableProviderFactorys on theSessionContextto allow them to be registered programmatically but used dynamicallyAre there any user-facing changes?
They can now register custom tables.