Open
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Using datafusion to run a window function partitioned by a nested data type column results in a nested comparison error during execution:
InvalidArgumentError("Nested comparison: Struct([Field { name: \"f1\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) IS DISTINCT FROM Struct([Field { name: \"f1\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) (hint: use make_comparator instead)")
This is a feature request to add nested partitioning support to the partition kernel:
arrow-rs/arrow-ord/src/partition.rs
Line 126 in d4b9482
Describe the solution you'd like
partition
shells out to distinct
, which does not support nested comparisons:
Lines 179 to 181 in d4b9482
My proposal is to add a check for nested type columns and use
make_comparator
to check for value distinctness instead.
Describe alternatives you've considered
- Expanding nested array fields to primitive arrays. This seems costly
- Allowing nested comparisons in
compare_op
for certain op types where null ordering semantics don't matter (which is the case here I think). This is another option, but it seems like the proposed approach is a more general solution which can be swapped out if performance becomes an issue.