Skip to content

arrow-ord: support partitioning on nested types #7130

Open
@asubiotto

Description

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Using datafusion to run a window function partitioned by a nested data type column results in a nested comparison error during execution:

InvalidArgumentError("Nested comparison: Struct([Field { name: \"f1\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) IS DISTINCT FROM Struct([Field { name: \"f1\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) (hint: use make_comparator instead)")

This is a feature request to add nested partitioning support to the partition kernel:

pub fn partition(columns: &[ArrayRef]) -> Result<Partitions, ArrowError> {

Describe the solution you'd like
partition shells out to distinct, which does not support nested comparisons:

/// Nested types, such as lists, are not supported as the null semantics are not well-defined.
/// For comparisons involving nested types see [`crate::ord::make_comparator`]
pub fn distinct(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<BooleanArray, ArrowError> {

My proposal is to add a check for nested type columns and use make_comparator to check for value distinctness instead.

Describe alternatives you've considered

  • Expanding nested array fields to primitive arrays. This seems costly
  • Allowing nested comparisons in compare_op for certain op types where null ordering semantics don't matter (which is the case here I think). This is another option, but it seems like the proposed approach is a more general solution which can be swapped out if performance becomes an issue.

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions