Skip to content

feat(datafusion-proto): allow TableSource to be serialized #16750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

colinmarc
Copy link

Currently, only instances of TableProvider are considered by LogicalExtensionCodec, and are automatically wrapped in a DefaultTableSource when deserializing. That doesn't work with custom extensions.

Which issue does this PR close?

Rationale for this change

It's currently impossible to serialize/deserialize custom TableSource implementations with LogicalExtensionCodec.

What changes are included in this PR?

LogicalExtensionCodec is changed to have a try_{en,de}code_table_source instead of table_provider. This is a breaking change.

Are these changes tested?

This is covered by existing serde tests.

Are there any user-facing changes?

Yes, implementors of LogicalExtensionCodec downstream will be affected.

@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels Jul 11, 2025
@colinmarc colinmarc force-pushed the serialize-table-source branch from f02ef93 to ceeab4f Compare July 11, 2025 15:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @colinmarc -- the code here looks good

Can you please add a test to make sure we don't accidentally break this feature in the future?

@colinmarc colinmarc force-pushed the serialize-table-source branch 2 times, most recently from 9cdffc3 to 0dab84f Compare July 11, 2025 21:43
@colinmarc
Copy link
Author

I added a test! Let me know if that seems like enough.

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @colinmarc -- this is looking good now and the test is great.

I have a question about unwrap_provider vs source_as_provider

Otherwise since this PR changes the API and will require existing users to upgrade, I think the only other thing missing in this PR is a note in the upgrade guide. Basically it can show a before/after of the code (you can copy/paste from this example)

https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md#datafusion-4900

Let me know if I can help or you run into trouble

@@ -45,6 +45,26 @@ impl DefaultTableSource {
pub fn new(table_provider: Arc<dyn TableProvider>) -> Self {
Self { table_provider }
}

/// Attempt to downcast a TableSource to DefaultTableSource and access the
/// TableProvider. This will only work with a TableSource created by DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this method when we already have source_as_provider.

If we do need the method, I think it would help to give some more context / pointers on how to use it and why it is different than the souce_as_provider

Suggested change
/// TableProvider. This will only work with a TableSource created by DataFusion.
/// TableProvider. This will only work with a TableSource created by [`provider_as_source`]
///
/// This is commonly used to downcast the result of [`provider_as_source`] back to the TableProvider.

Bonus points if we could figure out how to do an doc example showing how to use unwrap_provider on the output of provider_as_source (though I think the setup would be pretty tough)

Copy link
Author

@colinmarc colinmarc Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly necessary to keep it. My original plan was to remove source_as_provider and replace it with this method. I thought that it should be associated with DefaultTableSource, since that's actually what's being used to wrap/unwrap here.

Then I noticed that doing both downcasts in one function avoided adding an extra indentation level in the to_proto logic, e.g.

if let Some(provider) = source_as_provider(source) {
    if let Some(listing_table) = provider.downcast_ref::<ListingTable>() {
        // ...
    }
}

So I kept both. If you'd like I can do the following:

  • Get rid of source_as_provider (it's only used in one place, where an explicit error might be better - trying to convert a logical scan of a TableSource into a physical plan should be an error)
  • Change calls to provider_as_source(x) to Arc::new(DefaultTableSource::new(x))
  • Document both with testable examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to keep everything consistent as the dichotomy between TableSource and TableProvider is already confusing enough

How about this for a proposal:

  1. We keep the new DefaultTableSource method
  2. We add the upgrade guide note on the changes to API
  3. We file a follow on ticket to improve the API (I can help here, and I am sure someone would do the code change if we file the ticket -- it doesn't have to be you)

As a follow on PR we can then discuss consolidating / removing source_as_provider

Get rid of source_as_provider (it's only used in one place, where an explicit error might be better - trying to convert a logical scan of a TableSource into a physical plan should be an error)

Can we please deprecate it instead like https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines

Change calls to provider_as_source(x) to Arc::new(DefaultTableSource::new(x))

Sounds good -- or maybe we could make a method on DefaultTableSource like DefaultTableSource::provider_as_source to people didn't have to wrap it in an arc 🤔

Likewise, if you make this change please rename provider_as_source

Document both with testable examples.

Yes please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed it as a second commit, happy to open a separate PR if you prefer.

@@ -87,7 +107,7 @@ impl TableSource for DefaultTableSource {
}
}

/// Wrap TableProvider in TableSource
/// Wrap a TableProvider as a TableSource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Wrap a TableProvider as a TableSource.
/// Wrap a TableProvider as a TableSource.
///
/// See [`source_as_provider`] to retrieve the original
/// TableSource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped this since I deprecated the method instead.

@@ -117,18 +117,18 @@ pub trait LogicalExtensionCodec: Debug + Send + Sync {

fn try_encode(&self, node: &Extension, buf: &mut Vec<u8>) -> Result<()>;

fn try_decode_table_provider(
fn try_decode_table_source(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to me to have the logical plan serialization be in terms of a TableSource (the logical part) rather than a Table Provider

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.TableSource.html

I realize there are no existing docs on this method, but given we are changing it I think it would be helpful if we added some hints for other readers

Something like

Suggested change
fn try_decode_table_source(
/// Attempts to decode a [`TableSource`]
///
/// Note that the default codec will serialize [`TableProvider`] as a [`DefaultTableProvider`]
/// you can use [`DefaultTableSource::unwrap_provider`] to recover the original `TableProvider`.
fn try_decode_table_source(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked your suggestion, let me know if you're happy with it (the default codec isn't relevant here - this code only gets called with custom table sources / providers).

@colinmarc
Copy link
Author

colinmarc commented Jul 12, 2025

Somewhat unrelated, but maybe something like the following would be cleaner (untested)?

trait TableProvider: TableSource {
     // only scan and get_table_definition
}

It would be a breaking change, but it's not a difficult migration (you just add an impl TableSource for MyTableProvider and move the other methods there).

@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

Somewhat unrelated, but maybe something like the following would be cleaner (untested)?

trait TableProvider: TableSource {
     // only scan and get_table_definition
}

It would be a breaking change, but it's not a difficult migration (you just add an impl TableSource for MyTableProvider and move the other methods there).

I agree this would be better still -- I suggest we look into this as a follow on PR (and file a ticket to prevent it from being lost)

@github-actions github-actions bot added documentation Improvements or additions to documentation substrait Changes to the substrait crate labels Jul 15, 2025
@colinmarc colinmarc force-pushed the serialize-table-source branch from a8ca798 to 78ceeba Compare July 15, 2025 12:06
@colinmarc
Copy link
Author

Otherwise since this PR changes the API and will require existing users to upgrade, I think the only other thing missing in this PR is a note in the upgrade guide. Basically it can show a before/after of the code (you can copy/paste from this example)

Done, thanks for the quick review!

@colinmarc colinmarc force-pushed the serialize-table-source branch 3 times, most recently from 3734478 to 70d5e9f Compare July 17, 2025 11:04
@colinmarc colinmarc requested a review from alamb July 17, 2025 11:06
@colinmarc colinmarc force-pushed the serialize-table-source branch 3 times, most recently from dc6f885 to e30dfe4 Compare July 18, 2025 11:10
@colinmarc
Copy link
Author

The doc tests were fiddly, sorry about that. I got them to pass locally now.

Currently, only instances of `TableProvider` are considered by
`LogicalExtensionCodec`, and are automatically wrapped in a
`DefaultTableSource` when deserializing. That doesn't work with custom
extensions.

Fixes apache#16749
These replace `provider_as_source` and `source_as_provider`, and are
somewhat more explicit.
@colinmarc colinmarc force-pushed the serialize-table-source branch from e30dfe4 to 6b7d21b Compare July 18, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate catalog Related to the catalog crate core Core DataFusion crate documentation Improvements or additions to documentation proto Related to proto crate substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serializing custom TableSource implementations fails
2 participants