Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(logical-types): add NativeType and LogicalType #12853

Merged
merged 17 commits into from
Nov 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Support TypeSignatures
  • Loading branch information
notfilippo committed Oct 15, 2024
commit 42179705fb071627fc498cec6ef198c4b7658545
39 changes: 0 additions & 39 deletions datafusion/common/src/types/builtin.rs

This file was deleted.

78 changes: 72 additions & 6 deletions datafusion/common/src/types/logical.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,46 @@ use std::{cmp::Ordering, hash::Hash, sync::Arc};

use super::NativeType;

#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum TypeSignature<'a> {
/// Represents a built-in native type.
Native(&'a NativeType),
/// Represents an arrow-compatible extension type.
/// (https://arrow.apache.org/docs/format/Columnar.html#extension-types)
///
/// The `name` should contain the same value as 'ARROW:extension:name'.
Extension {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add the comment why the extension is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because arrow has extension type too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got there is a link to extension Arrow types above, it should be enough

name: &'a str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be owned String, so that TypeSignature can live independently on its own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeSignature uses lifetimes just to make pattern matching easier when dealing with parameters (or lack thereof).

pub fn important_stuff(logical_type: &dyn LogicalType) {
    match logical_type.signature() { 
        TypeSignature::Native(NativeType::Utf8) => todo!(),
        TypeSignature::Extension {
            name: "JSON", // <-- changing from &'a str to String leads to: expected `String`, but found `&str`,
            parameters: &[]
        } => todo!(),
        _ => unimplemented!()
    }
} 

parameters: &'a [TypeParameter<'a>],
},
}

#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum TypeParameter<'a> {
Type(TypeSignature<'a>),
Number(i128),
}
Comment on lines +39 to +43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone unfamiliar with the entire discussion of this feature, it's not clear to me from reading the above link to the extension-types of arrow how these type parameters map to the metadata described in the link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally everything could be encoded in the name using ARROW:extension:name but we might leave the option handle parameters in a more structured manner to allow better matching when executing.


/// A reference counted [`LogicalType`]
pub type LogicalTypeRef = Arc<dyn LogicalType>;

pub trait LogicalType: fmt::Debug {
pub trait LogicalType: Sync + Send {
fn native(&self) -> &NativeType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously propose can_decode_to(DataType) -> bool, so given logical type and DataType, we can know whether they are paired.

How can we do the equivalent check by the current design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given say arrow Int64 data, i want to know whether these is numbers, timestamp, time, date or something else (eg user-defined enum). The fact that any of these hypothetical logical types could be stored as Int64 doesn't help me know. Asking logical type "could you please decode this arrow type?" doesn't help me know.
Thus, going from arrow type to logical type is not an option. We simply need to know what logical type this should be.

Copy link
Contributor

@jayzhan211 jayzhan211 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that we have LogicalType already. In logical level, they are either LogicalNumber, LogicalTimestamp or LogicalDate, and we can differ them in logical level. They can also decode as i64, i32 in physical level. So asking logical type "could you please decode this arrow type?" is to tell the relationship between logical type and physical type. We don't need to know whether the arrow i64 is number or timestamp, because we already know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can follow. @jayzhan211 -- can you write a small practical example? I want to make sure I understand the use case. Thanks :)

Copy link
Contributor

@jayzhan211 jayzhan211 Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl From<DataType> for NativeType is enough for native type since we can know whether the ArrayRef matches the LogicalType we have. But for LogicalType::UserDefined, I think we need to define what kind of DataType it could be decoded to.

We can figure this out if we meet any practical usage.

Copy link
Contributor Author

@notfilippo notfilippo Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any user defined logical type you still know the backing native type (via the native() method), so you should be able to use the same logic to know if your DataType can represent that logical type.

fn name(&self) -> Option<&str>;
fn signature(&self) -> TypeSignature<'_>;
}

impl fmt::Debug for dyn LogicalType {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_tuple("LogicalType")
.field(&self.signature())
.field(&self.native())
.finish()
}
}

impl PartialEq for dyn LogicalType {
fn eq(&self, other: &Self) -> bool {
self.native().eq(other.native()) && self.name().eq(&other.name())
self.native().eq(other.native()) && self.signature().eq(&other.signature())
}
}

Expand All @@ -44,15 +73,52 @@ impl PartialOrd for dyn LogicalType {

impl Ord for dyn LogicalType {
fn cmp(&self, other: &Self) -> Ordering {
self.name()
.cmp(&other.name())
self.signature()
.cmp(&other.signature())
.then(self.native().cmp(other.native()))
}
}

impl Hash for dyn LogicalType {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.name().hash(state);
self.signature().hash(state);
self.native().hash(state);
}
}

#[cfg(test)]
mod test {
#![allow(dead_code)]

use super::{LogicalType, TypeParameter, TypeSignature};
use crate::types::NativeType;

struct MagicalType {}

impl LogicalType for MagicalType {
fn native(&self) -> &NativeType {
&NativeType::Utf8
}

fn signature(&self) -> TypeSignature<'_> {
TypeSignature::Extension {
name: "MagicalType",
parameters: &[TypeParameter::Type(TypeSignature::Native(
&NativeType::Boolean,
))],
}
}
}

fn test(logical_type: &dyn LogicalType) {
match logical_type.signature() {
TypeSignature::Extension {
name: "MagicalType",
parameters:
[TypeParameter::Type(TypeSignature::Native(NativeType::Boolean))],
} => {}
TypeSignature::Native(NativeType::Binary) => todo!(),
_ => unimplemented!(),
};
}
}
2 changes: 0 additions & 2 deletions datafusion/common/src/types/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,8 @@
// specific language governing permissions and limitations
// under the License.

mod builtin;
mod logical;
mod native;

pub use builtin::*;
pub use logical::*;
pub use native::*;
12 changes: 12 additions & 0 deletions datafusion/common/src/types/native.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ use std::sync::Arc;

use arrow_schema::{DataType, Field, Fields, IntervalUnit, TimeUnit, UnionFields};

use super::{LogicalType, TypeSignature};

#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct NativeField {
name: String,
Expand Down Expand Up @@ -190,6 +192,16 @@ pub enum NativeType {
Map(NativeFieldRef),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in #12853 (comment)

Map should be something like Map(key_type, value_type)

value_type should be any type, not necessarily native.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It should be LogicalTypeRef here and in the NativeField type then (which I guess will become the LogicalField type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is a bit of a problem translating DataType::Map to NativeType::Map since there is no way of retrieving the key and the value because of the "but this is not enforced." regarding the naming of the struct field types.

}

impl LogicalType for NativeType {
fn native(&self) -> &NativeType {
self
}

fn signature(&self) -> TypeSignature<'_> {
TypeSignature::Native(self)
}
}

// The following From<DataType>, From<Field>, ... implementations are temporary
// mapping solutions to provide backwards compatibility while transitioning from
// the purely physical system to a logical / physical system.
Expand Down
Loading