Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Wrong and low inefficient expression execution for [if/else, case/when ... etc] expression #41094

Open
ZhangHuiGui opened this issue Apr 9, 2024 · 18 comments

Comments

@ZhangHuiGui
Copy link
Collaborator

ZhangHuiGui commented Apr 9, 2024

Describe the bug, including details regarding any error messages, version, and platform.

A bug is found here. When we execute a similar case when a != 0 then b / a, div 0 exception will be reported. But logically, such a statement should be executed.
The reason why the above execution will report an error is because of the existing ExecuteScalarExpression execution system logic, as shown in the following code:

for (size_t i = 0; i < arguments.size(); ++i) {
     ARROW_ASSIGN_OR_RAISE(
         arguments[i], ExecuteScalarExpression(call->arguments[i], input, exec_context));
     if (arguments[i].is_array()) {
       all_scalar = false;
     }
   }

For the above statement, the expression arguments[0] corresponding to the judgment condition a != 0 will be executed first, and then the arguments[1] corresponding to b / a will be executed. Obviously, logically speaking, b / a should not be fully executed, it should be partially executed based on a != 0

Because if-else related expressions have a special execution order, should we extend ExecuteScalarIfElseExpression to handle such expressions separately? The execution order is as follows:

  1. First execute the condition corresponding to the if statement
  2. Filter the corresponding data based on the execution result of the if statement
  3. Perform the remaining operations

To do this, we obviously need to extend Expression to support if-else related logic and callbacks

laotan332 qinpengxiang@outlook.com

Component(s)

C++

@ZhangHuiGui
Copy link
Collaborator Author

ZhangHuiGui commented Apr 9, 2024

@kou @bkietz @pitrou @lidavidm Any suggestion on this issue? Seems we need to refactor the if/else (and other similar kernel)'s execute logic to solve the problem and also improve it's performance?

@mapleFU
Copy link
Member

mapleFU commented Apr 10, 2024

A bug is found here. When we execute a similar case when a != 0 then b / a, div 0 exception will be reported. But logically, such a statement should be executed.

should not be executed?

I think a Selector would be help here. But I don't know whether it breaks the current design. I'll dive into it later

@qinpengxiang001
Copy link

A bug is found here. When we execute a similar case when a != 0 then b / a, div 0 exception will be reported. But logically, such a statement should be executed.

should not be executed?

I think a Selector would be help here. But I don't know whether it breaks the current design. I'll dive into it later

Corresponding to the case of a = 0, b / a should not be executed, but in fact, the existing logic has already executed

@bkietz
Copy link
Member

bkietz commented Apr 23, 2024

We have talked about adding support for selection vectors but never got around to adding support for them: https://github.com/bkietz/arrow/blob/c25866bc59e30e43e8dc3b05f3973d48074c3594/cpp/src/arrow/compute/exec.h#L199-L204

If you're interested in adding support for them, that would be greatly appreciated

@ZhangHuiGui
Copy link
Collaborator Author

If you're interested in adding support for them, that would be greatly appreciated

Thanks for your suggestion, we'll take a look.

@felipecrv
Copy link
Contributor

felipecrv commented Apr 30, 2024

Because if-else related expressions have a special execution order, should we extend ExecuteScalarIfElseExpression to handle such expressions separately?

The solution should not be specific to if-then-else. Other constructs pose similar challenges (e.g. CASE WHEN):

Velox (a vectorized query engine) calls these constructs "special forms". Lisp uses the same terminology. Although if and cond look like function calls in Lisp [1][2], they are special in that not all arguments are eagerly evaluated:

(if (> den 0) (/ num den) 0)

Other special forms are and, or, and cond:

(and expr ...)
(or expr ...)

(cond [test-expr body ...+]
      ...)

and is special because after the first false you get, you don't have to evaluate more input expressions. or is special because after the first true you get, you... cond is like CASE .. WHEN in SQL — a chain of if-then-else's.

This can't be done properly when if_then_else, case_when, and, or are implemented as functions like in Arrow currently. We need to create special expression fragments that can receive unevaluated expressions as input parameters.

They would look like compute::Expression::Call that can receive compute::Expression instances as parameters [3].

-  using Impl = std::variant<Datum, Parameter, Call>;
+  using Impl = std::variant<Datum, Parameter, Call, Special>;

Vectorized evaluation of special forms

Here I will try to expand on what @bkietz meant by "support for selection vectors".

Let's say we want to eval the following special form:

if bool_expr then list_value_length(a) else list_value_length(b)

bool_expr is always evaluated. When it's evaluated, a Type::BOOL array is produced.

The naive algorithm to evaluate the whole expression would then, be:

for (int64_t i = 0; i < cond.length; i++) {
  if (GetBit(cond.buffers[1].data, i)) {
    out[i] = list_value_length(a, i);
  } else {
    out[i] = list_value_length(b, i);
  }
}

this would be very inefficient since we're dealing with columnar data layouts here.

The vectorization friendly solution is to make every compute function "maskable"/"selectable". This can be made efficient if list_value_length can receive an extra "filter" parameter that selects the positions that list_value_length should read read from input and write to output:

auto cond = Evaluate(bool_expr);
auto then_selection_vec = BoolToSelVec(cond);
auto else_selection_vec = BoolToSelVec(not(cond));
auto out = PreallocateResultArray();
call list_value_length::ExecWithSel(then_selection_vec, a, &out);
call list_value_lenght::ExecWithSel(else_selection_vec, b, &out);

A "selection vector" is an array of indices. GetTakeIndicesFromBitmap [4] produces one from boolean arrays/bitmaps. By passing the selection vectors to the kernels, you allow each kernel to vectorize as is most appropriate to what it's doing: for instance, list_value_length can gather the offsets of the indices in the selection vector, perform all the subtractions to calculate the sizes, and scatter the values into the value buffer in out.

Incremental improvement

It's unrealistic to expect we will specialize all kernel functions to handle the optional selection vector parameter, but we should at least push them all the way down to Function::Execute. If a Function doesn't handle selection vectors, then a call is dispatched for all the values (as it is today) and code similar to Take is used to gather the values that the selection vector selects. The more functions become aware of selection vectors, the less we have to rely on this slow code.

It's not a trivial project.

[1] https://courses.cs.northwestern.edu/325/readings/special-forms.html
[2] https://docs.racket-lang.org/guide/conditionals.html
[3]

using Impl = std::variant<Datum, Parameter, Call>;

[4] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc#L275

@ZhangHuiGui
Copy link
Collaborator Author

It would be very grateful if you move this forward.

This can't be done properly when if_then_else, case_when, and, or are implemented as functions like in Arrow currently. We need to create special expression fragments that can receive unevaluated expressions as input parameters.

They would look like compute::Expression::Call that can receive compute::Expression instances as parameters [3].

The implementation method you mentioned is more elegant, and it basically does not affect the scheduling of the expression system of other functions.
We could focus on the special cases in expression logic.

And implement a scheduling framework so that relevant special functions can be Incremental pushed into this framework later.

@mapleFU
Copy link
Member

mapleFU commented May 1, 2024

Besides, Velox supports filter reordering, so error might be handled in a kleene logic way. I think introducing a special form firstly is ok

@felipecrv
Copy link
Contributor

The implementation method you mentioned is more elegant, and it basically does not affect the scheduling of the expression system of other functions.
We could focus on the special cases in expression logic.

The starting point is making special forms representable in Expression::Impl, then adding placeholder implementations that don't prevent the full vectorization.

@zanmato1984
Copy link
Collaborator

Hi @felipecrv , I'm interested in this and doing some POC in my local (thanks for the instructive proposal!). One thing that I'm having trouble with is that, to allow the kernels to become "selection vector aware" incrementally:

Incremental improvement

It's unrealistic to expect we will specialize all kernel functions to handle the optional selection vector parameter, but we should at least push them all the way down to Function::Execute. If a Function doesn't handle selection vectors, then a call is dispatched for all the values (as it is today) and code similar to Take is used to gather the values that the selection vector selects. The more functions become aware of selection vectors, the less we have to rely on this slow code.

We not only Take the selected rows to pass to a kernel that has not yet been selection vector aware, but also "scatter" the "partial" output by putting the rows back to the corresponding position to the original input rows. However I couldn't find any existing facility for this scatter operation as handy as Take/Filter for gather. Do you know any? Or do we need a Scatter function, or something more light-weight like Concatenate [1], to be implemented first? Thanks.

[1]

Result<std::shared_ptr<Array>> Concatenate(const ArrayVector& arrays,
MemoryPool* pool = default_memory_pool());

@felipecrv
Copy link
Contributor

felipecrv commented Jun 1, 2024

You're right. It's not just taking/gathering the values, we have to stitch the output array from many different outputs. And you're right — we don't have a scatter facility — but note that for some array types (e.g. list and binary/string), you can't scatter based on destination indices. You will have to append in order from the arrays resulting from each if/else or case/when branch.

This is where the recent additions (list-view and string-view) can shine because you can scatter by index to them in any order as the variable-length data doesn't have to be packed in the same order as the offsets and sizes.

@mapleFU
Copy link
Member

mapleFU commented Jun 2, 2024

This is where the recent additions (list-view and string-view) can shine because you can scatter by index to them in any order as the variable-length data doesn't have to be packed in the same order as the offsets and sizes.

Aha, seems that dictionary, string/binary view, ree, list-view should have some unified place for them to execute fastly :-)

@felipecrv
Copy link
Contributor

felipecrv commented Jun 2, 2024

This is where the recent additions (list-view and string-view) can shine because you can scatter by index to them in any order as the variable-length data doesn't have to be packed in the same order as the offsets and sizes.

Aha, seems that dictionary, string/binary view, ree, list-view should have some unified place for them to execute fastly :-)

It's still preferable that the kernels take selection/mask as an auxiliary parameter. It's only the generic scatter phase that's optimal.

REE's and dictionaries can be handled generically for every scalar kernel without problems though. Just pass the values array and re-use the integer buffer in the REE/DICTIONARY array.

@mapleFU
Copy link
Member

mapleFU commented Jun 3, 2024

It's still preferable that the kernels take selection/mask as an auxiliary parameter. It's only the generic scatter phase that's optimal.

In the implementation, maybe we can first implement gather-scatter? Since Arrow doesn't gurantee a data type supports un-ordered write currently ( which is different from velox )

@zanmato1984
Copy link
Collaborator

In the implementation, maybe we can first implement gather-scatter? Since Arrow doesn't gurantee a data type supports un-ordered write currently ( which is different from velox )

  1. According to @felipecrv , we don't have an existing "scatter" function, so yes we'll have to implement that first.
  2. When this "scatter" function is available, we can build a "naive" evaluation of special form on top of it (as well as "gather") - by doing a centralized "gather the input by the selection vector, pass the selected rows to the dumb kernel, and scatter the kernel's output back to the actual output". This requires NO kernel's awareness of selection vector, making available of an incremental approach of 3.
  3. Gradually make kernel support selection vector as an optional parameter, i.e., selection-vector-aware. Once all done, we don't need 2 any more.

I think both 2 and 3 could potentially benefit from leveraging special attributes of specific data types such as list/string-view, ree and dict, though I'm not exactly sure how. I'm now working on an overall framework, maybe things will become clearer when I get there. I can use some help/comment from you guys then :)

@felipecrv
Copy link
Contributor

I think both 2 and 3 could potentially benefit from leveraging special attributes of specific data types such as list/string-view, ree and dict, though I'm not exactly sure how.

The leverage is always a function of type+kernel. The first kernels that deserve good specialization are array_take (the "gather") and the scatter.

For types that support out-of-order writing (list-view, dict, string-view...) you can scatter incrementally:

scatter(branch0, sel0, &output)
scatter(branch1, sel1, &output);
...
scatter(branchn, seln, &output);

(this assumes all the selection vectors are disjoint)

For types that need in-order appending, you will need all selection vectors and merge them:

selections = MinHeapOfSelections{{branch0, sel0, 0}, {branch1, sel1, 0}, ..., {branchn, seln, 0}};
while (!selections.empty()) {
  i = min_selection(selections);
  output_builder.AppendFrom(selections[i].branch, selections[i].pos);
  selections.ExtractMin(/*hint=*/i);
}

These branches and selection vectors are described in this comment about evaluation of case-when #41453 (comment)

I'm now working on an overall framework, maybe things will become clearer when I get there. I can use some help/comment from you guys then :)

The details will become clear only when you try to draft a big change for sure, but for the sake of review an interesting first step would be the representation of the cond special form in compute::Expression with strict/eager evaluation that you later replace.

@zanmato1984
Copy link
Collaborator

The details will become clear only when you try to draft a big change for sure, but for the sake of review an interesting first step would be the representation of the cond special form in compute::Expression with strict/eager evaluation that you later replace.

Sure, I'll send out a draft PR soon sketching the whole idea after making the code more readable.

@zanmato1984
Copy link
Collaborator

PR #42106 drafted to show the main sketch. I've also put some notes in PR description. @felipecrv would you please to take a look? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants