-
Couldn't load subscription status.
- Fork 3.9k
ARROW-13530: [C++] Implement cumulative sum compute function #12460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
I thought we can't implement this until we decide how to handle ordering within the query engine? CC @westonpace I may not have the context |
|
Oh, or it can just be a vector function, which is fine, though not usable within the query engine. |
|
I think we can write the scalar function for users calling into compute directly. I agree it would be somewhat meaningless today if applied in, for example, a project expression. Can we have it take a starting value as one of the function options? Then we can use this whenever we get around to implementing the feature in the execution engine. |
|
@JabariBooker The following are the main actions to complete when adding a compute function to C++, including Python bindings:
|
@westonpace Why a An
|
This would be a vector function but we already have this with
Isn't this a scalar function? I may be misunderstanding the term. My current interpretation is "each row has a single output value". |
|
@westonpace Oops! I forgot that |
Given that definition I agree this is a vector function. I was thinking only of the shape of the return value and not the statefull-ness of the function. |
|
...well..., I just noticed that my definition is not exactly the definition stated in the source code, so it would be great to have other opinions on this. |
|
This sounds like a function "whose behavior depends on the values of the entire arrays passed", no? |
|
@lidavidm I agree. |
|
From a consumption standpoint the concept of "scalar expression" is pretty important: Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/expressions010.htm For example, only a scalar expression can be used in a project expression. These are contrasted with "table expressions" which can return multiple rows for each execution. If Arrow wants to have a concept of "scalar function" and "vector function" that is different then I think that is "ok but confusing". For example, to validate a plan, we can inspect the return value of the function, and ensure that it is scalar, to decide if an expression is a "scalar expression" regardless of whether it is a "scalar function". |
|
We already do that verification:
|
|
So this particular function can be a vector function because it would be somewhat meaningless to use in a project expression anyways. Is there any example where a vector function might want to be used in a project expression? Maybe it's a moot point and we can decide "scalar expression" means "single result AND stateless" |
|
I almost wonder if cumulative sum and other such functions should be their own (sub)class, though, since they can be processed incrementally with some state (and that state can be represented easily as an Arrow scalar), at least for the ones Pandas supports (min/max/sum/product). |
|
And then we could have an exec node that knows how to handle them appropriately. |
|
Ok. I agree with you both. "scalar" implies stateless. This function is not stateless and thus not scalar. Whether we want to call it vector or something else we can figure out later. I suspect whatever pattern we end up needing will be useful when implementing window functions. |
Hmm, I can imagine ones like fill_null, drop_null might be desirable. drop_null is an example of something which isn't stateful, but which is not scalar. And fill_null_forward sorta fits into a 'cumulative' function (not backward unless you want to support directionality). I guess these don't really fit as projections, but rather as postprocessing steps (it doesn't really make sense to say |
|
Yeah for our purposes let's call this vector, and we can see how things to as we add more support in the query engine. |
|
Thank you both for this insightful discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @JabariBooker and sorry for the delay. Here are a couple more concerns.
|
Sorry for the delay @JabariBooker . This looks great to me, and I'm going to merge now. Thank you for contributing this! |
|
Benchmark runs are scheduled for baseline = 931422b and contender = b851392. b851392 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
|
I'm refactoring all the vector kernels and I observed that the behavior of this kernel is inconsistent between scalar and array inputs: A scalar null returns an array of length with 0 while an array with a single null returns an array with one null I'm not sure this is right -- I can preserve this behavior, but do we want to fix it? |
|
(it's also unclear to me that supporting scalar inputs to this function is useful, but that's a separate question) |
|
In the exec plan / engine (e.g. everywhere using ExecBatch), scalar columns have length (part of the reason I'm a fan of treating scalars everywhere as RLE encoded arrays is to easily express this). For this function, if we are going to allow it to be called as a "scalar function", it must not change the length of incoming arrays. So I would expect: A scalar null with length 0 should return an empty array or a scalar null with length 0. |
|
I suppose, if the kernel functions are going to keep having scalars without length then the correct thing to do would be to always output either a scalar or an array of length 1. |
|
In the Acero execution engine, vector functions won't be supported in expressions aside from window functions (I guess). I'm inclined to simply disable the scalar input path here since it is ill-defined right now. I'll do that in my forthcoming patch |
In ARROW-16577 (which I'm going to tackle within the next week hopefully), I'm going to remove the all-scalar input path from all kernels and up-promote ExecBatch with all scalars to be ExecSpan with arrays of length 1 (and unbox the output to be a scalar again if it's appropriate). |
An ExecBatch with all scalars does not necessarily have a length of 1. |
I believe there has been some desire for a "table UDF" which would require a new TableUdfNode (i.e. it would not use the project node). As far as I can tell there are no rules for a table UDF so it is a free for all. So this could qualify as a "table function". However, table functions are not yet defined / implemented so they are in the same boat as window functions (which is probably the most correct category for this). So +1 to the idea of just removing the path for now. |
I think we're talking about different things — many of the ScalarKernels have two implementations: one for all scalars which returns a scalar, and another for arrays that returns an array. I'm just talking about nixing the first code path such that the scalars -> scalar code path runs through a common implementation rather than a duplicated one. |
|
I think I see your point. Is the code path for |
|
Yes, definitely. I'm just referring to the implementation of e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_nested.cc#L614 |
Fixes #35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( #12460 ) isn't a scalar expression. * Closes: #35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
### Rationale for this change Add a `pairwise_diff` function similar to pandas' [Series.Diff](https://pandas.pydata.org/docs/reference/api/pandas.Series.diff.html), the function computes the first order difference of an array. ### What changes are included in this PR? I followed [these instructions](#12460 (comment)). The function is implemented for numerical, temporal and decimal types. Chuck arrays are not yet supported. ### Are these changes tested? Yes. They are tested in vector_pairwise_test.cc and in python/pyarrow/tests/compute.py. ### Are there any user-facing changes? Yes, and docs are also updated in this PR. * Closes: #35786 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Creating new compute function to perform a cumulative sum on a given array/vector.