ARROW-13530: [C++] Implement cumulative sum compute function #12460

JabariBooker · 2022-02-18T04:54:23Z

Creating new compute function to perform a cumulative sum on a given array/vector.

github-actions · 2022-02-18T04:54:43Z

https://issues.apache.org/jira/browse/ARROW-13530

github-actions · 2022-02-18T04:54:45Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-02-18T20:05:46Z

I thought we can't implement this until we decide how to handle ordering within the query engine? CC @westonpace I may not have the context

lidavidm · 2022-02-18T20:06:12Z

Oh, or it can just be a vector function, which is fine, though not usable within the query engine.

westonpace · 2022-02-18T20:19:18Z

I think we can write the scalar function for users calling into compute directly. I agree it would be somewhat meaningless today if applied in, for example, a project expression.

Can we have it take a starting value as one of the function options?

Then we can use this whenever we get around to implementing the feature in the execution engine.

cpp/src/arrow/compute/kernels/scalar_cumulative_sum.cc

edponce · 2022-03-03T00:00:16Z

@JabariBooker The following are the main actions to complete when adding a compute function to C++, including Python bindings:

Implement function/kernels in C++ .cc file
Add function declaration to C++ public API
Add function to C++ docs
Only applicable if new FunctionOptions are defined:
Python tests for compute function
Add function and FunctionOptions to Python docs

edponce · 2022-03-03T00:02:05Z

I think we can write the scalar function for users calling into compute directly. I agree it would be somewhat meaningless today if applied in, for example, a project expression.

@westonpace Why a ScalarFunction and not a VectorFunction? I understand the ordering issue, but...

An Array can be:

summed up into a single total sum
cumulative sum resulting in an Array with partial sums

westonpace · 2022-03-03T01:14:52Z

@edponce

@westonpace Why a ScalarFunction and not a VectorFunction? I understand the ordering issue, but...

An Array can be:
* summed up into a single total sum

This would be a vector function but we already have this with sum right?

* cumulative sum resulting in an `Array` with partial sums

Isn't this a scalar function? I may be misunderstanding the term. My current interpretation is "each row has a single output value".

edponce · 2022-03-03T01:25:23Z

@westonpace Oops! I forgot that sum existed.
A ScalarFunction can operate independently on each element of the Array, but a cumulative sum requires the sum of the previous element to compute the next one (or the entire Array, depends on impl.), so this pattern I think conforms more to a VectorFunction.

westonpace · 2022-03-03T01:34:24Z

A ScalarFunction can operate independently on each element of the Array

Given that definition I agree this is a vector function. I was thinking only of the shape of the return value and not the statefull-ness of the function.

edponce · 2022-03-03T01:42:59Z

...well..., I just noticed that my definition is not exactly the definition stated in the source code, so it would be great to have other opinions on this.
cc @lidavidm @pitrou

lidavidm · 2022-03-03T01:44:16Z

This sounds like a function "whose behavior depends on the values of the entire arrays passed", no?

edponce · 2022-03-03T01:46:10Z

@lidavidm I agree.
Also, a ScalarFunction should conform to this iteration pattern, and cumulative sum does not, so it is a VectorFunction.

westonpace · 2022-03-03T01:52:38Z

From a consumption standpoint the concept of "scalar expression" is pretty important:

Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/expressions010.htm
CockroachDb: https://www.cockroachlabs.com/docs/stable/scalar-expressions.html
Substrait: https://substrait.io/expressions/scalar_functions/

For example, only a scalar expression can be used in a project expression. These are contrasted with "table expressions" which can return multiple rows for each execution.

If Arrow wants to have a concept of "scalar function" and "vector function" that is different then I think that is "ok but confusing". For example, to validate a plan, we can inspect the return value of the function, and ensure that it is scalar, to decide if an expression is a "scalar expression" regardless of whether it is a "scalar function".

lidavidm · 2022-03-03T01:54:49Z

We already do that verification:

arrow/cpp/src/arrow/compute/exec/expression.cc

Line 283 in e989fb3

return call->function->kind() == compute::Function::SCALAR;

westonpace · 2022-03-03T01:56:13Z

So this particular function can be a vector function because it would be somewhat meaningless to use in a project expression anyways. Is there any example where a vector function might want to be used in a project expression? Maybe it's a moot point and we can decide "scalar expression" means "single result AND stateless"

lidavidm · 2022-03-03T01:56:15Z

I almost wonder if cumulative sum and other such functions should be their own (sub)class, though, since they can be processed incrementally with some state (and that state can be represented easily as an Arrow scalar), at least for the ones Pandas supports (min/max/sum/product).

lidavidm · 2022-03-03T01:56:35Z

And then we could have an exec node that knows how to handle them appropriately.

westonpace · 2022-03-03T01:58:24Z

Ok. I agree with you both. "scalar" implies stateless. This function is not stateless and thus not scalar. Whether we want to call it vector or something else we can figure out later. I suspect whatever pattern we end up needing will be useful when implementing window functions.

lidavidm · 2022-03-03T01:58:29Z

Is there any example where a vector function might want to be used in a project expression?

Hmm, I can imagine ones like fill_null, drop_null might be desirable. drop_null is an example of something which isn't stateful, but which is not scalar. And fill_null_forward sorta fits into a 'cumulative' function (not backward unless you want to support directionality). I guess these don't really fit as projections, but rather as postprocessing steps (it doesn't really make sense to say drop_null(a) + drop_null(b), but you could have `a + b and then drop_null)

lidavidm · 2022-03-03T01:59:06Z

Yeah for our purposes let's call this vector, and we can see how things to as we add more support in the query engine.

edponce · 2022-03-03T02:01:38Z

Thank you both for this insightful discussion.
Created ARROW-15832 to capture this discussion (still need to add more details).

cpp/src/arrow/compute/kernels/vector_cumulative_sum.cc

cpp/src/arrow/compute/api_vector.h

cpp/src/arrow/compute/kernels/vector_cumulative_sum.cc

pitrou

Thanks for the update @JabariBooker and sorry for the delay. Here are a couple more concerns.

cpp/src/arrow/compute/kernels/vector_cumulative_ops.cc

cpp/src/arrow/compute/kernels/vector_cumulative_ops_test.cc

cpp/src/arrow/compute/kernels/vector_cumulative_ops.cc

cpp/src/arrow/compute/kernels/vector_cumulative_ops_test.cc

python/pyarrow/tests/test_compute.py

pitrou · 2022-05-31T13:22:42Z

Sorry for the delay @JabariBooker . This looks great to me, and I'm going to merge now. Thank you for contributing this!

ursabot · 2022-05-31T23:31:55Z

Benchmark runs are scheduled for baseline = 931422b and contender = b851392. b851392 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.78% ⬆️0.47%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.79% ⬆️0.2%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b8513920 ec2-t3-xlarge-us-east-2
[Finished] b8513920 test-mac-arm
[Finished] b8513920 ursa-i9-9960x
[Finished] b8513920 ursa-thinkcentre-m75q
[Finished] 931422bb ec2-t3-xlarge-us-east-2
[Failed] 931422bb test-mac-arm
[Finished] 931422bb ursa-i9-9960x
[Finished] 931422bb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-05-31T23:32:04Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

wesm · 2022-06-17T16:08:35Z

I'm refactoring all the vector kernels and I observed that the behavior of this kernel is inconsistent between scalar and array inputs:

A scalar null returns an array of length with 0 while an array with a single null returns an array with one null

In [5]: pc.cumulative_sum(pa.array([None], type='int32')[0], skip_nulls=True)
Out[5]: 
<pyarrow.lib.Int32Array object at 0x7fa80d3e7640>
[
  0
]

In [6]: pc.cumulative_sum(pa.array([None], type='int32'), skip_nulls=True)
Out[6]: 
<pyarrow.lib.Int32Array object at 0x7fa80d3e7700>
[
  null
]

I'm not sure this is right -- I can preserve this behavior, but do we want to fix it?

wesm · 2022-06-17T16:09:16Z

(it's also unclear to me that supporting scalar inputs to this function is useful, but that's a separate question)

westonpace · 2022-06-17T17:34:05Z

In the exec plan / engine (e.g. everywhere using ExecBatch), scalar columns have length (part of the reason I'm a fan of treating scalars everywhere as RLE encoded arrays is to easily express this).

For this function, if we are going to allow it to be called as a "scalar function", it must not change the length of incoming arrays. So I would expect:

A scalar null with length 0 should return an empty array or a scalar null with length 0.
A scalar null with length X should return an array of nulls with length X or a scalar null with length X.

westonpace · 2022-06-17T17:43:41Z

I suppose, if the kernel functions are going to keep having scalars without length then the correct thing to do would be to always output either a scalar or an array of length 1.

wesm · 2022-06-17T17:54:32Z

In the Acero execution engine, vector functions won't be supported in expressions aside from window functions (I guess). I'm inclined to simply disable the scalar input path here since it is ill-defined right now. I'll do that in my forthcoming patch

wesm · 2022-06-17T17:56:10Z

I suppose, if the kernel functions are going to keep having scalars without length then the correct thing to do would be to always output either a scalar or an array of length 1.

In ARROW-16577 (which I'm going to tackle within the next week hopefully), I'm going to remove the all-scalar input path from all kernels and up-promote ExecBatch with all scalars to be ExecSpan with arrays of length 1 (and unbox the output to be a scalar again if it's appropriate).

westonpace · 2022-06-17T17:58:39Z

In ARROW-16577 (which I'm going to tackle within the next week hopefully), I'm going to remove the all-scalar input path from all kernels and up-promote ExecBatch with all scalars to be ExecSpan with arrays of length 1 (and unbox the output to be a scalar again if it's appropriate).

An ExecBatch with all scalars does not necessarily have a length of 1.

westonpace · 2022-06-17T18:06:58Z

In the Acero execution engine, vector functions won't be supported in expressions aside from window functions (I guess). I'm inclined to simply disable the scalar input path here since it is ill-defined right now. I'll do that in my forthcoming patch

I believe there has been some desire for a "table UDF" which would require a new TableUdfNode (i.e. it would not use the project node). As far as I can tell there are no rules for a table UDF so it is a free for all. So this could qualify as a "table function".

However, table functions are not yet defined / implemented so they are in the same boat as window functions (which is probably the most correct category for this). So +1 to the idea of just removing the path for now.

wesm · 2022-06-17T18:32:38Z

An ExecBatch with all scalars does not necessarily have a length of 1.

I think we're talking about different things — many of the ScalarKernels have two implementations: one for all scalars which returns a scalar, and another for arrays that returns an array. I'm just talking about nixing the first code path such that the scalars -> scalar code path runs through a common implementation rather than a duplicated one.

westonpace · 2022-06-17T22:39:08Z

I think I see your point. Is the code path for array x scalar remaining (e.g. add(field_ref("x"), 7))?

wesm · 2022-06-17T22:43:03Z

Yes, definitely. I'm just referring to the implementation of e.g. f(scalar) -> scalar) or g(scalar, scalar) -> scalar — the ability to perform these operations will remain but the implementation will use the array implementation allowing us to simply delete the all-scalar implementation. See for example what it looks like to have duplicate implementations of a scalar kernel like "map_lookup"

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_nested.cc#L614

Fixes #35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( #12460 ) isn't a scalar expression. * Closes: #35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

### Rationale for this change Add a `pairwise_diff` function similar to pandas' [Series.Diff](https://pandas.pydata.org/docs/reference/api/pandas.Series.diff.html), the function computes the first order difference of an array. ### What changes are included in this PR? I followed [these instructions](#12460 (comment)). The function is implemented for numerical, temporal and decimal types. Chuck arrays are not yet supported. ### Are these changes tested? Yes. They are tested in vector_pairwise_test.cc and in python/pyarrow/tests/compute.py. ### Are there any user-facing changes? Yes, and docs are also updated in this PR. * Closes: #35786 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

JabariBooker marked this pull request as draft February 18, 2022 04:54

github-actions bot added the Component: C++ label Feb 18, 2022

JabariBooker marked this pull request as ready for review March 2, 2022 07:48

edponce suggested changes Mar 2, 2022

View reviewed changes

edponce reviewed Mar 2, 2022

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_cumulative_sum.cc Outdated Show resolved Hide resolved

edponce reviewed Mar 2, 2022

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_cumulative_sum.cc Outdated Show resolved Hide resolved

github-actions bot added the Component: Python label Mar 9, 2022

lidavidm reviewed Mar 9, 2022

View reviewed changes

Handling builder.Append() Status output

3673dec

pitrou reviewed May 16, 2022

View reviewed changes

JabariBooker added 2 commits May 16, 2022 17:21

Using templated functions instead of macros for IntegerOverflow test

2e20590

Added test for scalar null inputs; Minor change to python tests

3f36557

JabariBooker requested a review from pitrou May 23, 2022 15:56

pitrou approved these changes May 31, 2022

View reviewed changes

pitrou closed this in b851392 May 31, 2022

asfimport mentioned this pull request Jun 21, 2022

[C++] Implement cumulative sum compute function #29183

Closed

arnaud-feldmann mentioned this pull request Apr 25, 2023

GH-35180: [R] Implement bindings for cumsum function #35339

Merged

ianmcook mentioned this pull request May 9, 2023

[C++] Implement cumulative product, max, and min compute functions #32190

Closed

js8544 mentioned this pull request May 26, 2023

GH-35786: [C++] Add pairwise_diff function #35787

Merged

Uh oh!

ARROW-13530: [C++] Implement cumulative sum compute function #12460

ARROW-13530: [C++] Implement cumulative sum compute function #12460

Uh oh!

Conversation

JabariBooker commented Feb 18, 2022

Uh oh!

github-actions bot commented Feb 18, 2022

Uh oh!

github-actions bot commented Feb 18, 2022

Uh oh!

lidavidm commented Feb 18, 2022

Uh oh!

lidavidm commented Feb 18, 2022

Uh oh!

westonpace commented Feb 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edponce commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edponce commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edponce commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

edponce commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

edponce commented Mar 3, 2022

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

westonpace commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

lidavidm commented Mar 3, 2022

Uh oh!

edponce commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented May 31, 2022

Uh oh!

ursabot commented May 31, 2022

edponce commented Mar 3, 2022 •

edited

Loading

edponce commented Mar 3, 2022 •

edited

Loading

westonpace commented Mar 3, 2022 •

edited

Loading

edponce commented Mar 3, 2022 •

edited

Loading

edponce commented Mar 3, 2022 •

edited

Loading

wesm commented Jun 17, 2022 •

edited

Loading

wesm commented Jun 17, 2022 •

edited

Loading