Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(logical-types): add NativeType and LogicalType #12853

Merged
merged 17 commits into from
Nov 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add documentation
  • Loading branch information
notfilippo committed Oct 17, 2024
commit 5b5f4c13beb61f37f2d21ea0ec747275dc4a5123
30 changes: 29 additions & 1 deletion datafusion/common/src/types/logical.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,37 @@ pub enum TypeParameter<'a> {
Number(i128),
}

/// A reference counted [`LogicalType`]
/// A reference counted [`LogicalType`].
pub type LogicalTypeRef = Arc<dyn LogicalType>;

/// Representation of a logical type with its signature and its native backing
/// type.
///
/// The logical type is meant to be used during the DataFusion logical planning
/// phase in order to reason about logical types without worrying about their
/// underlying physical implementation.
///
/// ### Extension types
///
/// [`LogicalType`] is a trait in order to allow the possibility of declaring
/// extension types:
///
/// ```
/// struct JSON {}
///
/// impl LogicalType for JSON {
/// fn native(&self) -> &NativeType {
/// &NativeType::Utf8
/// }
///
/// fn signature(&self) -> TypeSignature<'_> {
/// TypeSignature::Extension {
/// name: "JSON",
/// parameters: &[],
/// }
/// }
/// }
/// ```
pub trait LogicalType: Sync + Send {
fn native(&self) -> &NativeType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously propose can_decode_to(DataType) -> bool, so given logical type and DataType, we can know whether they are paired.

How can we do the equivalent check by the current design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given say arrow Int64 data, i want to know whether these is numbers, timestamp, time, date or something else (eg user-defined enum). The fact that any of these hypothetical logical types could be stored as Int64 doesn't help me know. Asking logical type "could you please decode this arrow type?" doesn't help me know.
Thus, going from arrow type to logical type is not an option. We simply need to know what logical type this should be.

Copy link
Contributor

@jayzhan211 jayzhan211 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that we have LogicalType already. In logical level, they are either LogicalNumber, LogicalTimestamp or LogicalDate, and we can differ them in logical level. They can also decode as i64, i32 in physical level. So asking logical type "could you please decode this arrow type?" is to tell the relationship between logical type and physical type. We don't need to know whether the arrow i64 is number or timestamp, because we already know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can follow. @jayzhan211 -- can you write a small practical example? I want to make sure I understand the use case. Thanks :)

Copy link
Contributor

@jayzhan211 jayzhan211 Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl From<DataType> for NativeType is enough for native type since we can know whether the ArrayRef matches the LogicalType we have. But for LogicalType::UserDefined, I think we need to define what kind of DataType it could be decoded to.

We can figure this out if we meet any practical usage.

Copy link
Contributor Author

@notfilippo notfilippo Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any user defined logical type you still know the backing native type (via the native() method), so you should be able to use the same logic to know if your DataType can represent that logical type.

fn signature(&self) -> TypeSignature<'_>;
Expand Down
45 changes: 41 additions & 4 deletions datafusion/common/src/types/native.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,32 +15,69 @@
// specific language governing permissions and limitations
// under the License.

use std::sync::Arc;
use std::{ops::Deref, sync::Arc};

use arrow_schema::{DataType, Field, Fields, IntervalUnit, TimeUnit, UnionFields};

use super::{LogicalType, TypeSignature};

/// A record of a native type, its name and its nullability.
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct NativeField {
name: String,
native_type: NativeType,
nullable: bool,
}

impl NativeField {
pub fn name(&self) -> &str {
&self.name
}

pub fn native_type(&self) -> &NativeType {
&self.native_type
}

pub fn nullable(&self) -> bool {
self.nullable
}
}

/// A reference counted [`NativeField`].
pub type NativeFieldRef = Arc<NativeField>;

/// A cheaply cloneable, owned collection of [`NativeFieldRef`].
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct NativeFields(Arc<[NativeFieldRef]>);

impl Deref for NativeFields {
type Target = [NativeFieldRef];

fn deref(&self) -> &Self::Target {
self.0.as_ref()
}
}

/// A cheaply cloneable, owned collection of [`NativeFieldRef`] and their
/// corresponding type ids.
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct NativeUnionFields(Arc<[(i8, NativeFieldRef)]>);

impl Deref for NativeUnionFields {
type Target = [(i8, NativeFieldRef)];

fn deref(&self) -> &Self::Target {
self.0.as_ref()
}
}

/// Representation of a type that DataFusion can handle natively. It is a subset
/// of the physical variants in Arrow's native [`DataType`].
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum NativeType {
/// Null type
Null,
/// A boolean datatype representing the values `true` and `false`.
/// A boolean type representing the values `true` and `false`.
Boolean,
/// A signed 8-bit integer.
Int8,
Expand Down Expand Up @@ -162,9 +199,9 @@ pub enum NativeType {
List(NativeFieldRef),
/// A list of some logical data type with fixed length.
FixedSizeList(NativeFieldRef, i32),
/// A nested datatype that contains a number of sub-fields.
/// A nested type that contains a number of sub-fields.
Struct(NativeFields),
/// A nested datatype that can represent slots of differing types.
/// A nested type that can represent slots of differing types.
Union(NativeUnionFields),
/// Decimal value with precision and scale
///
Expand Down
Loading