-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(datafusion-proto): allow TableSource to be serialized #16750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f02ef93
to
ceeab4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @colinmarc -- the code here looks good
Can you please add a test to make sure we don't accidentally break this feature in the future?
9cdffc3
to
0dab84f
Compare
I added a test! Let me know if that seems like enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @colinmarc -- this is looking good now and the test is great.
I have a question about unwrap_provider
vs source_as_provider
Otherwise since this PR changes the API and will require existing users to upgrade, I think the only other thing missing in this PR is a note in the upgrade guide. Basically it can show a before/after of the code (you can copy/paste from this example)
Let me know if I can help or you run into trouble
@@ -45,6 +45,26 @@ impl DefaultTableSource { | |||
pub fn new(table_provider: Arc<dyn TableProvider>) -> Self { | |||
Self { table_provider } | |||
} | |||
|
|||
/// Attempt to downcast a TableSource to DefaultTableSource and access the | |||
/// TableProvider. This will only work with a TableSource created by DataFusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this method when we already have source_as_provider
.
If we do need the method, I think it would help to give some more context / pointers on how to use it and why it is different than the souce_as_provider
/// TableProvider. This will only work with a TableSource created by DataFusion. | |
/// TableProvider. This will only work with a TableSource created by [`provider_as_source`] | |
/// | |
/// This is commonly used to downcast the result of [`provider_as_source`] back to the TableProvider. |
Bonus points if we could figure out how to do an doc example showing how to use unwrap_provider
on the output of provider_as_source
(though I think the setup would be pretty tough)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly necessary to keep it. My original plan was to remove source_as_provider
and replace it with this method. I thought that it should be associated with DefaultTableSource
, since that's actually what's being used to wrap/unwrap here.
Then I noticed that doing both downcasts in one function avoided adding an extra indentation level in the to_proto
logic, e.g.
if let Some(provider) = source_as_provider(source) {
if let Some(listing_table) = provider.downcast_ref::<ListingTable>() {
// ...
}
}
So I kept both. If you'd like I can do the following:
- Get rid of
source_as_provider
(it's only used in one place, where an explicit error might be better - trying to convert a logical scan of a TableSource into a physical plan should be an error) - Change calls to
provider_as_source(x)
toArc::new(DefaultTableSource::new(x))
- Document both with testable examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to keep everything consistent as the dichotomy between TableSource and TableProvider is already confusing enough
How about this for a proposal:
- We keep the new DefaultTableSource method
- We add the upgrade guide note on the changes to API
- We file a follow on ticket to improve the API (I can help here, and I am sure someone would do the code change if we file the ticket -- it doesn't have to be you)
As a follow on PR we can then discuss consolidating / removing source_as_provider
Get rid of source_as_provider (it's only used in one place, where an explicit error might be better - trying to convert a logical scan of a TableSource into a physical plan should be an error)
Can we please deprecate it instead like https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
Change calls to provider_as_source(x) to Arc::new(DefaultTableSource::new(x))
Sounds good -- or maybe we could make a method on DefaultTableSource
like DefaultTableSource::provider_as_source
to people didn't have to wrap it in an arc 🤔
Likewise, if you make this change please rename provider_as_source
Document both with testable examples.
Yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I pushed it as a second commit, happy to open a separate PR if you prefer.
@@ -87,7 +107,7 @@ impl TableSource for DefaultTableSource { | |||
} | |||
} | |||
|
|||
/// Wrap TableProvider in TableSource | |||
/// Wrap a TableProvider as a TableSource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Wrap a TableProvider as a TableSource. | |
/// Wrap a TableProvider as a TableSource. | |
/// | |
/// See [`source_as_provider`] to retrieve the original | |
/// TableSource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skipped this since I deprecated the method instead.
@@ -117,18 +117,18 @@ pub trait LogicalExtensionCodec: Debug + Send + Sync { | |||
|
|||
fn try_encode(&self, node: &Extension, buf: &mut Vec<u8>) -> Result<()>; | |||
|
|||
fn try_decode_table_provider( | |||
fn try_decode_table_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense to me to have the logical plan serialization be in terms of a TableSource (the logical part) rather than a Table Provider
https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.TableSource.html
I realize there are no existing docs on this method, but given we are changing it I think it would be helpful if we added some hints for other readers
Something like
fn try_decode_table_source( | |
/// Attempts to decode a [`TableSource`] | |
/// | |
/// Note that the default codec will serialize [`TableProvider`] as a [`DefaultTableProvider`] | |
/// you can use [`DefaultTableSource::unwrap_provider`] to recover the original `TableProvider`. | |
fn try_decode_table_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked your suggestion, let me know if you're happy with it (the default codec isn't relevant here - this code only gets called with custom table sources / providers).
Somewhat unrelated, but maybe something like the following would be cleaner (untested)? trait TableProvider: TableSource {
// only scan and get_table_definition
} It would be a breaking change, but it's not a difficult migration (you just add an |
I agree this would be better still -- I suggest we look into this as a follow on PR (and file a ticket to prevent it from being lost) |
a8ca798
to
78ceeba
Compare
Done, thanks for the quick review! |
3734478
to
70d5e9f
Compare
dc6f885
to
e30dfe4
Compare
The doc tests were fiddly, sorry about that. I got them to pass locally now. |
Currently, only instances of `TableProvider` are considered by `LogicalExtensionCodec`, and are automatically wrapped in a `DefaultTableSource` when deserializing. That doesn't work with custom extensions. Fixes apache#16749
These replace `provider_as_source` and `source_as_provider`, and are somewhat more explicit.
e30dfe4
to
6b7d21b
Compare
Currently, only instances of
TableProvider
are considered byLogicalExtensionCodec
, and are automatically wrapped in aDefaultTableSource
when deserializing. That doesn't work with custom extensions.Which issue does this PR close?
TableSource
implementations fails #16749.Rationale for this change
It's currently impossible to serialize/deserialize custom
TableSource
implementations withLogicalExtensionCodec
.What changes are included in this PR?
LogicalExtensionCodec
is changed to have atry_{en,de}code_table_source
instead oftable_provider
. This is a breaking change.Are these changes tested?
This is covered by existing serde tests.
Are there any user-facing changes?
Yes, implementors of
LogicalExtensionCodec
downstream will be affected.