Replies: 3 comments 8 replies
-
|
@alamb I would like to ask a few questions.
For example, most of ideas from DuckDB, ClickHouse, Vertica or Azure can make an array a more complete data type. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @izveigor -- I read through the description again -- while I would not say I followed everything what I did made sense to me. Thank you for writing it. I think the biggest thing that is unclear to me is what are you proposing. Is it the current (postgres based) set of functions / operators for working with Or perhaps you mean this document to serve as the intellectual foundation for a coherent design for List support in DataFusion and not really a change at all. BTW I have a blog post in the works about the various types of structured arrays supported by Arrow (ListArray, StructArray, MapArray) that can hopefully help provide some more background for this discussion too. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for this @izveigor. I found your syntax summary and the comparison matrix very useful. I think we should have an array subpage in Datafusion docs that presents our syntax along with your comparison matrix -- I'm sure others will find it useful too. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Foreword
The article is about arrays in DataFusion. I decided to write it due to this topic become more popular among users and developers.
Active development is currently underway, dedicated to the formatio formation of an array as a full-fledged tool for working with information. I can not call the development itself successful, I think the main reason is the absence of an unified development concept. So, the main goal of this article is to draw up a general concept and give meaning to the technology for the user.
Backstory
Let's start from the beginning and how exactly the idea of implementing arrays was born. Prior to my introduction to DataFusion there were one array function (
make_array) and one internal function (unnest_columns). After that, there was a lull in this topic until a user with a nickname @bubbajoe brought this topic up again. During the discussions, we came to the conclusion that it is necessary to implement some array functions, and the relational database PostgreSQL was chosen as a reference. Most of the features have been implemented in PR: #6384. After the merge, we begin to realize that nested data types are not simple, since many DataFusion mechanisms are not adapted to work with them.The naive implementation of array functions brought a lot of problems and ambiguities. It is related to the fact that the most developers (including me) do not understand how arrays work, which is why they cannot solve the problem correctly. I see the solution to the problem in the use of some concepts of development ethics, because the development of such a voluminous topic should be introduced reasonably.
About development ethics
Let's start with the first mistake that has been made in naive implementation - not having a plan. The development was introduced incrementally, any feature was made taking into the account by individual need. This led to an increase in the cost of development, the lack of communication between individual features. Therefore, in my opinion, further development should go according to the plan. It will also helps to avoid the parallelism work that constantly arises during the wrong order of implementation of specific features.
New development plan
This part will be devoted to developing a plan for solving array problems.
This part is not complete, so more information will be added here over time.
The main postulates DataFusion
In order to form the principles by which arrays will work, it is necessary to understand the areas of application of this technology. Let's start with the purpose of the existence of DataFusion technology. @alamb repeatedly cited similar reasoning in his discussions. Key ideas include:
About array
Basic postulates:
ListArray(Apache Arrow);FixedSizeListArray,LargeListandListArraytransform toListArray;Listobjects are nested;Thus we can conclude that the main characteristics of the array are:
About array functions:
Basic postulates:
Why are arrays needed?
Now let's see what is the main point of using an array from the user's point of view. The main points of use are:
So, an array can be represented as:
About the features of the user query language
DataFusion adheres to the principles of SQL databases. Every command in SQL is atomic. The set of SQL command results forms a query language.
A language can be composed of many other languages.
Now let's figure out which languages will arrange work with an array:
Language DML
Let's start with the data manipulation language. For effective. Its principles:
Set the characteristics of the array:
In quantity:
By quality:
Based on the characteristics of the array, we obtain the corresponding functions.
Corresponding index table:
array_elementarray_insertSET array[i] = valuearray_deletearray_slicearray_insertSET array[i:j] = valuearray_deleteCorresponding table by values:
array_position(array, [i])array_resize(x, 1)array_replace(array, from, to)array_remove(array, element)array_positions(array)[:i]array_resize(x, n)array_replace(array, from, to, max)array_remove(array, element, max)array_positions(array)array_resize(x, array_length(array))array_replaces(array, from, to)array_removes(array, element)Corresponding table by borders:
array_firstarray_appendSET array[0] = valuearray_pop_frontarray_lastarray_prependSET array[array_length(array) - 1] = valuearray_pop_backarray_slice(0, n)array_insert(0, n)SET array[0:n] = valuearray_delete(0, n)array_slice(n, array_length(array))array_insert(n, array_length(array))SET array[n:array_length(array) - 1] = valuearray_delete(n, array_length(array))Corresponding table by predicates:
array_filterarray_transformarray_filterSet Language
An array can be viewed not only as a vector, but also as a set.
Basic set operations
array_concatorarray_union)array_intersector&&)array_except)All subsequent operations can be derived from these concepts.
Cartesian product:
array_zipExample (ClickHouse):
Checking if a subset belongs to a set
array_has_allChecking if any element of another set belongs to a set
array_has_anyChecking if an element belongs to a set
array_hasorarray_containsAbout aliases
My postulate regarding aliases is simple: an alias is admissible under conditions:
So, I propose to introduce the alias
list_for each function as a mutable object andarray_as an immutable object.Other aliases of functions should be assigned based on other technologies (for instance, the PostgreSQL function
array_catis an alias forarray_concat).Sort
An array can be sorted in two ways: left to right (
array_sort) and right to left (array_reverse_sort).About aggregate functions
Based on the postulate - convinience from the user's point of view, all aggregate functions must be adapted to the array.
Two approaches can be distinguished:
unnest)list_sum)list_aggregate)I think all these ways should be used.
Working with Array Characteristics
Characteristics can be divided into categories:
Quality:
Quantity:
For qualitative categories, the following behavior corresponds:
For quantitative categories, the following behavior corresponds:
Data type
array_length);CAST);Length
array_length);array_element(array, 0));cardinality);array_dims);Emptiness
empty);Dimension
array_ndims);flatten);Homogeneity
NOT (array_filter(array, x -> x IS NULL)));array_filter(array, x -> x IS NOT NULL));Uniqueness
array_distinct(array) == array);array_distinct);Documentation
After the implementation of each function, the corresponding documentation must be written.
Documentation should have the following features:
Other applications
For other areas (such as optimization, hashing, and so on), should be consistent with the general concept, but it is considered redundant to write about it here.
Criticism of various array implementations among other databases
PostgreSQL
Pros:
array_ndims,array_dims...);Cons:
It is diffucult to represent an array as a set in PostgreSQL, so its language is limited in the following operations:
Other restrictions include:
DuckDB
Pros:
Cons:
ClickHouse
Pros:
Cons:
Comparison
Here we will consider three binary relations: an equivalence relation, a parial order relation, and a strict order relation.
With order relationships (partial and strict), the results are based on the first other pair of elements, not on the sizes of the arrays.
With equivalence relations, results are based on exact element-by-element comparison (
=,!=).Operators
@>=array_contains(first_array, second_array);<@=array_contains(second_array, first_array);&&=array_intersect(first_array, second_array);||=array_concat(first_array, second_array)array[i]=array_element(array, i);array[i:j]=array_slice(array, i, j);array[i] = value=array_update(array, value, i);array[i:j] = values=array_update_slice(array, values, i, j);<,<=,=,!=,>,>=(See: Comparison);[a1,...,an]=make_array(a1,...,an);Reference: Examples of Difficult Scenarios
Example with stones
Let's say in some country there is a vote for the approval of some reform. Every citizen can vote for and against. The color of the stones will be used as the subject of the vote, namely black (against) and white (for). From the point of view of the array, this reconstruction will look like this, we have an array consisting of 0 (black color) and 1 (white color). However, a situation may arise the elections. For example, 2 white balls and 3 black balls were dishonestly counted in the vote. Having noticed the falsification, the state decides to remove or replace these stones with NULL.
Now let's try to solve these problems using different databases:
Initial list: [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0]
My decision
Other databases
There are no solutions using Turing incomplete declarative language (classic SQL) for databases like DuckDB, PostgreSQL.
List of array functions
List of array functions and their aliases
make_arraymake_listarray_elementlist_element,array[i]array_insertarray_insert,list_insertarray_deletelist_deletearray_slicearray[i:j],list_slicearray_positionlist_positionarray_repeatlist_repeatarray_replacelist_replacearray_removelist_removearray_positionslist_positionsarray_replaceslist_replacesarray_removeslist_removesarray_firstlist_firstarray_appendlist_appendarray_pop_frontlist_pop_frontarray_lastlist_lastarray_prependlist_prependarray_pop_backlist_pop_backarray_filterlist_filterarray_transformlist_transformarray_concatlist_concat,array_cat,list_cat,array_union,list_unionarray_intersect&&,list_intersectarray_exceptlist_exceptarray_ziplist_ziparray_hasarray_contains,list_contains,list_hasarray_has_anylist_has_anyarray_has_alllist_has_allarray_sortlist_sortarray_reverse_sortlist_reverse_sortunnestarray_aggregatelist_aggr,list_aggregate,array_aggrarray_lengthlist_lengthcardinalityarray_dimslist_dimsemptyarray_ndimslist_ndimsflattenarray_distinctlist_distinctList of array functions and their counterparts from other databases
make_arrayarray[]arraylist_valuearrayarray_elementarray[i]array[i]list_element(array, i)arrayElementarray_insertarray_insertarray_slicelist_slicearray_positionarray_positionarray_positionarray_positionindexOfarray_repeatarray_fillarray_repeatarrayResizearray_replacearray_removearray_positionsarray_positionsarray_replacesarray_replacearray_replacearray_transformarray_transformarray_removesarray_removearray_removearray_filterarray_filterarray_firstarray[0]array[0]array[0]arrayFirstarray_appendarray_appendarray_appendarray_appendarrayPushBackarray_pop_frontarray_pop_frontarrayPopFrontarray_lastarray[array_length(array) - 1]array[array_length(array) - 1]array[array_length(array) - 1]array[length(array) - 1]array_prependarray_prependarray_prependarrayPushFrontarray_pop_backarray_pop_backarrayPopBackarray_filterarray_filterarrayFilterarray_transformarray_transformarrayTransformarray_concatarray_concatarray_unionarray_concatarrayConcatarray_intersectarray_intersectarray_intersectarrayIntersectarray_exceptarray_exceptarray_exceptarray_ziparrays_ziparray_ziparray_hasarray_containsarray_hashasarray_has_any&&arrays_overlaparray_has_anyhasAnyarray_has_all@>or<@array_has_allhasAllarray_sortarray_sortarray_sortarraySortarray_reverse_sortarray_sortarray_reverse_sortarrayReverseSortunnestunnestunnestarrayJoinarray_aggregatelist_aggregatearray_lengtharray_lengtharray_lengtharray_lengthlengthcardinalitycardinalityarray_dimsarray_dimsemptyEmptyarray_ndimsarray_ndimsflattenflattenarrayFlattenarray_distinctarray_distinctarray_distinctarrayDistinctSources:
Articles:
https://www.usenix.org/system/files/login/articles/login_winter18_08_khurana.pdf
Videos:
https://www.youtube.com/watch?v=kHpy64smpYQ - From flat files to deconstructed database The evolution and future of the big data ecosyst
Documentations:
Pull Requests:
#6384
Issues:
#6119
Discussions:
#6441
Beta Was this translation helpful? Give feedback.
All reactions