Skip to content

Conversation

@JabariBooker
Copy link
Contributor

Creating new compute function to perform a cumulative sum on a given array/vector.

@JabariBooker JabariBooker marked this pull request as draft February 18, 2022 04:54
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member

I thought we can't implement this until we decide how to handle ordering within the query engine? CC @westonpace I may not have the context

@lidavidm
Copy link
Member

Oh, or it can just be a vector function, which is fine, though not usable within the query engine.

@westonpace
Copy link
Member

I think we can write the scalar function for users calling into compute directly. I agree it would be somewhat meaningless today if applied in, for example, a project expression.

Can we have it take a starting value as one of the function options?

Then we can use this whenever we get around to implementing the feature in the execution engine.

@JabariBooker JabariBooker marked this pull request as ready for review March 2, 2022 07:48
@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

I think we can write the scalar function for users calling into compute directly. I agree it would be somewhat meaningless today if applied in, for example, a project expression.

@westonpace Why a ScalarFunction and not a VectorFunction? I understand the ordering issue, but...

An Array can be:

  • summed up into a single total sum
  • cumulative sum resulting in an Array with partial sums

@westonpace
Copy link
Member

westonpace commented Mar 3, 2022

@edponce

@westonpace Why a ScalarFunction and not a VectorFunction? I understand the ordering issue, but...

An Array can be:

* summed up into a single total sum

This would be a vector function but we already have this with sum right?

* cumulative sum resulting in an `Array` with partial sums

Isn't this a scalar function? I may be misunderstanding the term. My current interpretation is "each row has a single output value".

@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

@westonpace Oops! I forgot that sum existed.
A ScalarFunction can operate independently on each element of the Array, but a cumulative sum requires the sum of the previous element to compute the next one (or the entire Array, depends on impl.), so this pattern I think conforms more to a VectorFunction.

@westonpace
Copy link
Member

A ScalarFunction can operate independently on each element of the Array

Given that definition I agree this is a vector function. I was thinking only of the shape of the return value and not the statefull-ness of the function.

@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

...well..., I just noticed that my definition is not exactly the definition stated in the source code, so it would be great to have other opinions on this.
cc @lidavidm @pitrou

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

This sounds like a function "whose behavior depends on the values of the entire arrays passed", no?

@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

@lidavidm I agree.
Also, a ScalarFunction should conform to this iteration pattern, and cumulative sum does not, so it is a VectorFunction.

@westonpace
Copy link
Member

From a consumption standpoint the concept of "scalar expression" is pretty important:

Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/expressions010.htm
CockroachDb: https://www.cockroachlabs.com/docs/stable/scalar-expressions.html
Substrait: https://substrait.io/expressions/scalar_functions/

For example, only a scalar expression can be used in a project expression. These are contrasted with "table expressions" which can return multiple rows for each execution.

If Arrow wants to have a concept of "scalar function" and "vector function" that is different then I think that is "ok but confusing". For example, to validate a plan, we can inspect the return value of the function, and ensure that it is scalar, to decide if an expression is a "scalar expression" regardless of whether it is a "scalar function".

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

We already do that verification:

return call->function->kind() == compute::Function::SCALAR;

@westonpace
Copy link
Member

So this particular function can be a vector function because it would be somewhat meaningless to use in a project expression anyways. Is there any example where a vector function might want to be used in a project expression? Maybe it's a moot point and we can decide "scalar expression" means "single result AND stateless"

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

I almost wonder if cumulative sum and other such functions should be their own (sub)class, though, since they can be processed incrementally with some state (and that state can be represented easily as an Arrow scalar), at least for the ones Pandas supports (min/max/sum/product).

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

And then we could have an exec node that knows how to handle them appropriately.

@westonpace
Copy link
Member

Ok. I agree with you both. "scalar" implies stateless. This function is not stateless and thus not scalar. Whether we want to call it vector or something else we can figure out later. I suspect whatever pattern we end up needing will be useful when implementing window functions.

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

Is there any example where a vector function might want to be used in a project expression?

Hmm, I can imagine ones like fill_null, drop_null might be desirable. drop_null is an example of something which isn't stateful, but which is not scalar. And fill_null_forward sorta fits into a 'cumulative' function (not backward unless you want to support directionality). I guess these don't really fit as projections, but rather as postprocessing steps (it doesn't really make sense to say drop_null(a) + drop_null(b), but you could have `a + b and then drop_null)

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2022

Yeah for our purposes let's call this vector, and we can see how things to as we add more support in the query engine.

@edponce
Copy link
Contributor

edponce commented Mar 3, 2022

Thank you both for this insightful discussion.
Created ARROW-15832 to capture this discussion (still need to add more details).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @JabariBooker and sorry for the delay. Here are a couple more concerns.

@JabariBooker JabariBooker requested a review from pitrou May 23, 2022 15:56
@pitrou
Copy link
Member

pitrou commented May 31, 2022

Sorry for the delay @JabariBooker . This looks great to me, and I'm going to merge now. Thank you for contributing this!

@pitrou pitrou closed this in b851392 May 31, 2022
@ursabot
Copy link

ursabot commented May 31, 2022

Benchmark runs are scheduled for baseline = 931422b and contender = b851392. b851392 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.78% ⬆️0.47%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.79% ⬆️0.2%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b8513920 ec2-t3-xlarge-us-east-2
[Finished] b8513920 test-mac-arm
[Finished] b8513920 ursa-i9-9960x
[Finished] b8513920 ursa-thinkcentre-m75q
[Finished] 931422bb ec2-t3-xlarge-us-east-2
[Failed] 931422bb test-mac-arm
[Finished] 931422bb ursa-i9-9960x
[Finished] 931422bb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 31, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@wesm
Copy link
Member

wesm commented Jun 17, 2022

I'm refactoring all the vector kernels and I observed that the behavior of this kernel is inconsistent between scalar and array inputs:

A scalar null returns an array of length with 0 while an array with a single null returns an array with one null

In [5]: pc.cumulative_sum(pa.array([None], type='int32')[0], skip_nulls=True)
Out[5]: 
<pyarrow.lib.Int32Array object at 0x7fa80d3e7640>
[
  0
]

In [6]: pc.cumulative_sum(pa.array([None], type='int32'), skip_nulls=True)
Out[6]: 
<pyarrow.lib.Int32Array object at 0x7fa80d3e7700>
[
  null
]

I'm not sure this is right -- I can preserve this behavior, but do we want to fix it?

@wesm
Copy link
Member

wesm commented Jun 17, 2022

(it's also unclear to me that supporting scalar inputs to this function is useful, but that's a separate question)

@westonpace
Copy link
Member

In the exec plan / engine (e.g. everywhere using ExecBatch), scalar columns have length (part of the reason I'm a fan of treating scalars everywhere as RLE encoded arrays is to easily express this).

For this function, if we are going to allow it to be called as a "scalar function", it must not change the length of incoming arrays. So I would expect:

A scalar null with length 0 should return an empty array or a scalar null with length 0.
A scalar null with length X should return an array of nulls with length X or a scalar null with length X.

@westonpace
Copy link
Member

I suppose, if the kernel functions are going to keep having scalars without length then the correct thing to do would be to always output either a scalar or an array of length 1.

@wesm
Copy link
Member

wesm commented Jun 17, 2022

In the Acero execution engine, vector functions won't be supported in expressions aside from window functions (I guess). I'm inclined to simply disable the scalar input path here since it is ill-defined right now. I'll do that in my forthcoming patch

@wesm
Copy link
Member

wesm commented Jun 17, 2022

I suppose, if the kernel functions are going to keep having scalars without length then the correct thing to do would be to always output either a scalar or an array of length 1.

In ARROW-16577 (which I'm going to tackle within the next week hopefully), I'm going to remove the all-scalar input path from all kernels and up-promote ExecBatch with all scalars to be ExecSpan with arrays of length 1 (and unbox the output to be a scalar again if it's appropriate).

@westonpace
Copy link
Member

In ARROW-16577 (which I'm going to tackle within the next week hopefully), I'm going to remove the all-scalar input path from all kernels and up-promote ExecBatch with all scalars to be ExecSpan with arrays of length 1 (and unbox the output to be a scalar again if it's appropriate).

An ExecBatch with all scalars does not necessarily have a length of 1.

@westonpace
Copy link
Member

In the Acero execution engine, vector functions won't be supported in expressions aside from window functions (I guess). I'm inclined to simply disable the scalar input path here since it is ill-defined right now. I'll do that in my forthcoming patch

I believe there has been some desire for a "table UDF" which would require a new TableUdfNode (i.e. it would not use the project node). As far as I can tell there are no rules for a table UDF so it is a free for all. So this could qualify as a "table function".

However, table functions are not yet defined / implemented so they are in the same boat as window functions (which is probably the most correct category for this). So +1 to the idea of just removing the path for now.

@wesm
Copy link
Member

wesm commented Jun 17, 2022

An ExecBatch with all scalars does not necessarily have a length of 1.

I think we're talking about different things — many of the ScalarKernels have two implementations: one for all scalars which returns a scalar, and another for arrays that returns an array. I'm just talking about nixing the first code path such that the scalars -> scalar code path runs through a common implementation rather than a duplicated one.

@westonpace
Copy link
Member

I think I see your point. Is the code path for array x scalar remaining (e.g. add(field_ref("x"), 7))?

@wesm
Copy link
Member

wesm commented Jun 17, 2022

Yes, definitely. I'm just referring to the implementation of e.g. f(scalar) -> scalar) or g(scalar, scalar) -> scalar — the ability to perform these operations will remain but the implementation will use the array implementation allowing us to simply delete the all-scalar implementation. See for example what it looks like to have duplicate implementations of a scalar kernel like "map_lookup"

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_nested.cc#L614

thisisnic added a commit that referenced this pull request Apr 26, 2023
Fixes #35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( #12460 ) isn't a scalar expression.
* Closes: #35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
)

Fixes apache#35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression.
* Closes: apache#35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
)

Fixes apache#35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression.
* Closes: apache#35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
bkietz added a commit that referenced this pull request Jun 29, 2023
### Rationale for this change

Add a `pairwise_diff` function similar to pandas' [Series.Diff](https://pandas.pydata.org/docs/reference/api/pandas.Series.diff.html), the function computes the first order difference of an array.
### What changes are included in this PR?

I followed [these instructions](#12460 (comment)). The function is implemented for numerical, temporal and decimal types. Chuck arrays are not yet supported.

### Are these changes tested?

Yes. They are tested in vector_pairwise_test.cc and in python/pyarrow/tests/compute.py.

### Are there any user-facing changes?

Yes, and docs are also updated in this PR.

* Closes: #35786

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants