Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Add LogicalType, try to support user-defined types #11160

Closed
wants to merge 14 commits into from

Conversation

notfilippo
Copy link
Contributor

@notfilippo notfilippo commented Jun 28, 2024

Proposal

The proposal now lives here: #11513, the original draft PR follows:


Which issue does this PR close?

Closes #7923
Follows up on #8143, which is stale.

In the current state the PR is a draft implementation to validate the idea based on the discussion from #7421.

New additions

  • LogicalType enum.
  • ExtensionType trait. Abstraction for extension types.
  • TypeSignature struct. Uniquely identifies a data type.
  • LogicalSchema & LogicalField, equivalent to arrow's Schema and Field
    • note: this was mostly a shortcut to be able to refactor somewhat easily
      without changing much of the logic of of DFSchema. In next iterations DFSchema and LogicalSchema could potentially merge.

Major changes

  • DFSchema uses LogicalSchema & LogicalField.
  • ExprSchemable and ExprSchema now use LogicalType.
  • ast to logical plan conversion now uses LogicalType.

To be implemented

  • Registering extension types through a ContextProvider.

To be determined

Note

Most of these open questions remain similar to the initial PR.

  • Should ScalarValue use LogicalType or arrow DataType?
    • LogicalType
    • DataType
  • Should TableSource return LogicalSchema or arrow Schema?
    • Schema
    • LogicalSchema (for now)

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait labels Jun 28, 2024
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot's to absorb here, so I'll be looking closer soon. But left some initial observations.

pub mod fields;

#[derive(Clone, Debug)]
pub enum LogicalType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)

Some notes:

  • IMO the Large list/binary/string variants should be considered encodings of their respective logical types. IMO considering these different types is one of the biggest sources of friction in the Arrow ecosystem and one I would hope would be solved by a
  • There's probably a good argument for signed and unsigned integers, as well as decimals, to be grouped together as more general types. That is, DataType::Int8 and DataType::Int64 could both be an encoded version of LogicalType::Int.
  • Date32 and Date64 might be consolidated. Date64 is a weird type that exists for Pandas compatibility, and doesn't actually provide more precision.
  • Types where the choice provides meaningful differences in precision should be preserved as logical types. For example, I think Float64 and Float16 shouldn't be considered the same logical type. Similarly for timestamp precisions.

Copy link
Contributor Author

@notfilippo notfilippo Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with all the points discussed and this matches my intuition.

The current state of the PR mostly reflects the state for the original proposal in #8143 as I didn't want to include any additional changes without discussing them first. I'm not too familiar on how these changes should be discussed and if these points need to be also circulated in the mailing list. Let me know what's best :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)

I also remember but can't find this discussion

As I recall, the conclusion was that types that can store different ranges of values shoudl be different logical types.

  • So that would mean Int32 should be different than UInt32
  • I don't recall consensus about String vs LargeString, but I agree that treating these as the same logical type with different phsical encodings would be the easiest to reason about (even though theoretically LargeString can represent strings larger than 2GB that can not be represented in String)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added support for treating large types as their non-large logical type. I agree that types that can store different ranges of values should be different logical types

pub trait ExtensionType: std::fmt::Debug {
fn display_name(&self) -> &str;
fn type_signature(&self) -> TypeSignature;
fn physical_type(&self) -> DataType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean we need to have a variant of each extension type for each physical storage type?

A good built-in example is string. Right now LogicalType::String always returns DataType::String, but arguably could be DataType::StringView or DataType::LargeString.

Copy link
Contributor Author

@notfilippo notfilippo Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Any LogicalType which has different physical representations should probably be implemented through multiple ExtensionTypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the code a bit to keep a reference to the physical type.

@alamb
Copy link
Contributor

alamb commented Jun 30, 2024

I plan to check this out shortly -- thanks @notfilippo

use std::sync::Arc;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct TypeSignature {
Copy link
Contributor

@jayzhan211 jayzhan211 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have existing TypeSignature for function

pub enum TypeSignature {

Therefore, I think what we need is rewriting existing signature with LogicalType, instead of introducing another similar concept here. Or we create a similar one and gradually replace the existing signature.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @notfilippo for reviving this PR.

I think this PR does a pretty good job of sketching out the structures required for logical types.

One thing I think would help me understand how complete / not complete this proposal is would be to show how we would use this code to create a user defined type. Would it be possible to sketch out an example, perhaps in https://github.com/apache/datafusion/tree/main/datafusion-examples, showing how we would use LogicalType to create a User Defined Type?

Obviously if we pursue this approach, there will be substantial churn for all downstream crates.

cc @yukkit @lewiszlw @samuelcolvin and @jayzhan211

Also this could be related to StringViewArray #10918 / @XiangpengHao

pub mod fields;

#[derive(Clone, Debug)]
pub enum LogicalType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a discussion to be had about what level we want logical type to be here. We previously discussed this, but I can't find that discussion anymore. (@alamb might remember.)

I also remember but can't find this discussion

As I recall, the conclusion was that types that can store different ranges of values shoudl be different logical types.

  • So that would mean Int32 should be different than UInt32
  • I don't recall consensus about String vs LargeString, but I agree that treating these as the same logical type with different phsical encodings would be the easiest to reason about (even though theoretically LargeString can represent strings larger than 2GB that can not be represented in String)

type Error = DataFusionError;

/// Create a Null instance of ScalarValue for this datatype
fn try_from(data_type: &LogicalType) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to agree that ScalarValue should follow LogicalType -- in fact we have had several issues when scalar value types don't quite match (like a logical string encoded as ScalarValue::String was compared to a DictionaryArray and was treated as a different type

Copy link
Contributor Author

@notfilippo notfilippo Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I'll add a note to refactor ScalarValues at some point

@notfilippo
Copy link
Contributor Author

notfilippo commented Jul 2, 2024

@alamb I've added an example with some comments and TODOs remarking my open questions.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool @notfilippo

In addition to supporting user defined types, I think adding the notion of a LogicalType would make it easier for DataFusion to take better advantage of different encodings (specifically RunArray, StringViewArray and DictionaryArray)

We already have ok support for DictionaryArray but it is somewhat unnatural and required adding special DataType::Dictionary support all over the codebase and had various bugs like #11145 and #11032 that resulted when we missed support in some non obvious place

As we begin adding support for StringViewArray in #10918 the same pattern is emerging

That is all to say, I think we should pursue this idea

The biggest challenge of making this kind of change I think will be to manage the rollout and migration with downstream crates / make the transition as smooth as possible

So my suggestions for next steps are:

  1. Try to update some downstream dependency to see what the change might require. Perhaps we can try with https://github.com/apache/datafusion-comet as that is open source
  2. Write a more detailed plan (the proposal seems similar to what is in [Proposal] Support User-Defined Types (UDT) #7923) for what we are going to change (with a specific section "What this would mean for downstream crates")

I think I can find time to help with the writeup and communication, and maybe the code. I want to finish up some other inflight work first (specifically finishing porting aggregate functions, cleaning up parquet statistics, and the user defined sql parser)

cc @yukkit @samuelcolvin

let df = ctx.sql("SELECT * FROM example").await?;
let records = df.collect().await?;

println!("{}", pretty_format_batches(&records)?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the primary usecase of the LogicalType so that we could write functions that take the logical type, or define custom behavior?

Like I wonder is the idea that we could now create a ScalarFunctionImpl whose signature refers to LogicalType rather than PhysicalType 🤔 Or somehow plan a binary operation on logical types?

BTW I think the work we are doing with @samuelcolvin @dharanad and @jayzhan211 in #11207 would make it straightforward to implement a custom comparsion (via a function) for this magical type

That might be a good thing to show too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I wonder is the idea that we could now create a ScalarFunctionImpl whose signature refers to LogicalType rather than PhysicalType 🤔 Or somehow plan a binary operation on logical types?

Potentially both. The motivation behind this change is to simplify the interaction between standard types represented by different encodings (like RunArray, the various Views and DictionaryArray but potentially also user defined ones via Arrow Extension Types).

@notfilippo
Copy link
Contributor Author

The biggest challenge of making this kind of change I think will be to manage the rollout and migration with downstream crates / make the transition as smooth as possible.

Completely agree. I will try to experiment with the user facing APIs (e.g. what's returned by the schema() method of structs implementing TableSource / TableProvider). I also think that most of the usage of DataType in physical and logical plans should be replaced with TypeRelation. I'll try to open a draft of what that would look like in comet in the coming days.

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

Thanks @notfilippo -- I will try and get the other projects I have under way to a better state so I can more fully help plan / communicate / coordinate this one.

@notfilippo
Copy link
Contributor Author

So I can more fully help plan / communicate / coordinate this one.

Sounds good! Feel free to follow up here / on slack / on discord 😄

I've also noticed that this potential change could greatly benefit the substrait encoding / decoding of the logical plan. Its current implementation has troubles dealing with dictionaries. I'll look into that as well while waiting for further instructions.

@alamb alamb mentioned this pull request Jul 12, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Jul 17, 2024
@notfilippo
Copy link
Contributor Author

notfilippo commented Jul 17, 2024

@alamb, @wjones127, @jayzhan211 -- I've found some time to finally draft a proposal: [Proposal] Decouple logical from physical types

(this draft PR was updated to DataFusion v40 in order to test datafusion-comet)

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Sep 17, 2024
@notfilippo notfilippo closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Support User-Defined Types (UDT)
5 participants