Int64 as default type for make_array function empty or null case#10790
Int64 as default type for make_array function empty or null case#10790jayzhan211 merged 2 commits intoapache:mainfrom
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
| (arrow_cast(make_array([[1,2]], [[4, 4]]), 'FixedSizeList(2, List(List(Int64)))'), arrow_cast(make_array([1,2,3], [1]), 'FixedSizeList(2, List(Int64))')), | ||
| (arrow_cast(make_array([[1], [2]], []), 'FixedSizeList(2, List(List(Int64)))'), arrow_cast(make_array([2], [3]), 'FixedSizeList(2, List(Int64))')), | ||
| (arrow_cast(make_array([[1], [2]], []), 'FixedSizeList(2, List(List(Int64)))'), arrow_cast(make_array([1], [2]), 'FixedSizeList(2, List(Int64))')), | ||
| (arrow_cast(make_array([[1], [2]], [[]]), 'FixedSizeList(2, List(List(Int64)))'), arrow_cast(make_array([2], [3]), 'FixedSizeList(2, List(Int64))')), |
There was a problem hiding this comment.
Now they should have same dimension unlike null type
| select array_concat(make_array(make_array(1, 2), make_array(3, 4)), make_array(make_array())); | ||
| ---- | ||
| [[1, 2], [3, 4]] | ||
| [[1, 2], [3, 4], []] |
There was a problem hiding this comment.
Consistent with duckdb
| query ? | ||
| select array_sort([]); | ||
| ---- | ||
| [] |
There was a problem hiding this comment.
free fix because of changing default type to i64
| query ? | ||
| select array_concat([]); | ||
| ---- | ||
| [] |
There was a problem hiding this comment.
free fix because of changing default type to i64
alamb
left a comment
There was a problem hiding this comment.
Looks like a reasonable change to me. Thanks @jayzhan211
BTW I tested in duckdb and it seems like the default type is actually int32 but I think int64 is close enough
D select [];
┌───────────────────┐
│ main.list_value() │
│ int32[] │
├───────────────────┤
│ [] │
└───────────────────┘
Yes, they use i32. The reason I choose i64 is that the default value in datafusion is mostly i64 so we can avoid the cast for most of the case. We can easily convert it to i32 anytime if there is any need |
|
Thanks @alamb |
…che#10790) * set default type i64 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
|
it seems counter-intuitive to me to infer Int64 where it was not provided by a user nor schema of tables a query operates on. It might be transparent to the user, in which case it's probably fine, but it's also likely that this will transpire to some error message, when an array made form make_array is used in some later processing. Also, it's likely to affect coercion rules in the future. |
I think any type is coercible for Null. Int64(Null) has no problem to be AnyType(Null) because it is |
| // Handle empty array at rhs case | ||
| // array_union(arr, []) -> arr; | ||
| // array_intersect(arr, []) -> []; | ||
| if r.value_length(0).is_zero() { | ||
| if set_op == SetOp::Union { | ||
| return Ok(Arc::new(l.clone()) as ArrayRef); | ||
| } else { | ||
| return Ok(Arc::new(r.clone()) as ArrayRef); | ||
| } | ||
| } |
There was a problem hiding this comment.
This will panic when r is empty and return wrong results when the first value in the r array is an empty list but the rest are not empty.
There was a problem hiding this comment.
Let's find such query first, since the case you mentioned might be handled already before entering to this line
There was a problem hiding this comment.
diff --git a/datafusion/sqllogictest/test_files/array.slt b/datafusion/sqllogictest/test_files/array.slt
index 3b7f12960..17598be8c 100644
--- a/datafusion/sqllogictest/test_files/array.slt
+++ b/datafusion/sqllogictest/test_files/array.slt
@@ -4371,7 +4371,8 @@ select array_union(arrow_cast([1, 2, 3, 4], 'LargeList(Int64)'), arrow_cast([5,
statement ok
CREATE TABLE arrays_with_repeating_elements_for_union
AS VALUES
- ([1], [2]),
+ ([0, 1, 1], []),
+ ([1, 1], [2]),
([2, 3], [3]),
([3], [3, 4])
;
@@ -4379,6 +4380,7 @@ AS VALUES
query ?
select array_union(column1, column2) from arrays_with_repeating_elements_for_union;
----
+[0, 1]
[1, 2]
[2, 3]
[3, 4]
@@ -4386,6 +4388,7 @@ select array_union(column1, column2) from arrays_with_repeating_elements_for_uni
query ?
select array_union(arrow_cast(column1, 'LargeList(Int64)'), arrow_cast(column2, 'LargeList(Int64)')) from arrays_with_repeating_elements_for_union;
----
+[0, 1]
[1, 2]
[2, 3]
[3, 4]
There was a problem hiding this comment.
[SQL] select array_union(column1, column2) from arrays_with_repeating_elements_for_union;
[Diff] (-expected|+actual)
- [0, 1]
- [1, 2]
+ [0, 1, 1]
+ [1, 1]
[2, 3]
- [3, 4]
+ [3]
at test_files/array.slt:4380
| // Empty array is a special case that is useful for many other array functions | ||
| pub(super) fn empty_array_type() -> DataType { | ||
| DataType::List(Arc::new(Field::new("item", DataType::Int64, true))) | ||
| } |
There was a problem hiding this comment.
I don't get the rationale behind that. From type-theoretical point of view, an empty array has type Array<Null> because you can coerce Null to any other type.
There was a problem hiding this comment.
You can think of this a type that is coerced from null to i64. You can't run the kernel function with null type, you need to coerce it to other type, and i64 is the default type we define
There was a problem hiding this comment.
You can't run the kernel function with null type
That's the real issue. Coercing to i64 is just a band-aid that creates other problems.
There was a problem hiding this comment.
continue on #15149 (comment), let's discussion about the issue it has
Which issue does this PR close?
Closes #10789 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?