- 
        Couldn't load subscription status. 
- Fork 3.9k
GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit #43389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or In the case of PARQUET issues on JIRA the title also supports: See also: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found that the functions in this file are not used anywhere. I updated it still though - maybe I'll clean it up or make it useful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do you mean the AVX2 speedups are actually not wired anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, they are not as far as I searched the code.
| @ursabot please benchmark | 
| Benchmark runs are scheduled for commit a789e04e964dacf02e76df3cfd48a59024843fce. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. | 
| Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a789e04e964dacf02e76df3cfd48a59024843fce. There were 5 benchmark results indicating a performance regression: 
 The full Conbench report has more details. | 
| The ASAN failure is because #43414. Others seem unrelated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this assertion will fire for 32-bit offset, because there are unreported offset overflow issues in fix-length code path as described in #43495, quoted below:
Even for the fixed-length code path, which doesn’t deal with the offset buffer at all and thus is supposed to be less problematic, there are obvious offset overflow issues like [1] and [2] (these issues are currently unreported but observed in my local experiments).
[1]
uint32_t offset_right = irow_right * fixed_length + offset_within_row; 
[2]arrow/cpp/src/arrow/compute/row/compare_internal_avx2.cc
Lines 243 to 244 in 187197c
__m256i offset_right = _mm256_mullo_epi32(irow_right, _mm256_set1_epi32(fixed_length)); 
|  | 
0e3a361    to
    2da4b07      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do you mean the AVX2 speedups are actually not wired anywhere?
| row_ids_out.data(), columns, row_table, | ||
| /*are_cols_in_encoding_order=*/true, | ||
| /*out_match_bitvector_maybe_null=*/NULLPTR); | ||
| ASSERT_EQ(num_rows_no_match, 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this means that the rows and columns compared equal, right? I'm trying to understand the API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It returns the rows not equal to columns.
| /*out_num_rows=*/NULLPTR, /*out_sel_left_maybe_same=*/NULLPTR, columns, row_table, | ||
| /*are_cols_in_encoding_order=*/true, match_bitvector.data()); | ||
| ASSERT_EQ(CountSetBits(match_bitvector.data(), 0, num_rows_to_compare), | ||
| num_rows_to_compare); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the docstring says that CompareColumnsToRows returns "a single 16-bit selection vector of rows that failed comparison". If rows are equal to columns, then surely all bits should be 0, not 1? Am I misunderstanding something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string is misleading. It actually returns the result in either of two ways: selection vector of the non-match row ids (as the doc string says), and the match bit vector (1 for match and 0 otherwise, this is not mentioned in the doc string), depending on if the output parameter of row ids is not null. And this code is checking using the second way.
I can update the doc string to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a weird API. Yes, updating the doc string would be nice :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| CI failures look unrelated, FTR. | 
| @github-actions crossbow submit -g cpp | 
| Revision: ded3e67 Submitted crossbow builds: ursacomputing/crossbow @ actions-c44c9b7d7c | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| Fab, a lot of people will be happy with this one. A few questions here: 
 | 
| Thanks for the feedback! 
 There is unfortunately no update nor plan about it AFAIK. By "crashed code solely because of the data size" if you mean the oom-kill of the operating system, then yes :) 
 Great question. I've thought about that too and had a short discussion with other developers in the community meeting. My reasons for not doing this are: 1) code complexity: there are lots of code that assume the offset type to be concrete, esp. in SIMD specialized functions, generalizing them is not trivial; 2) runtime overhead: since the data size is very likely unknown before the computation, the bottom line would be to use the 32-bit offset until certain limit (4GB) has reached as the data accumulates, by when we switch to 64-bit offset - a huge memory copy from the 32-bit offset buffer to the 64-bit offset buffer would happen. Thus the decision. | 
| After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5e68513. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 40 possible false positives for unstable benchmarks that are known to sometimes produce them. | 
| @github-actions crossbow submit test-r-rstudio-r-base-4.1-opensuse155 | 
| Revision: ded3e67 Submitted crossbow builds: ursacomputing/crossbow @ actions-947c28ffed 
 | 
### Rationale for this change A piece of comment is wrongly introduced in #43389. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
…alculation for fixed length and null masks (#45336) ### Rationale for this change #45334 ### What changes are included in this PR? 1. An all-mighty test case that can effectively reveal all the bugs mentioned in the issue; 2. Other than directly fixing the bugs (actually simply casting to 64-bit somewhere in the multiplication will do), I did some refinement to the buffer accessors of the row table, in order to eliminate more potential similar issues (which I believe do exist): 1. `null_masks()` -> `null_masks(row_id)` which does overflow-safe indexing inside; 2. `is_null(row_id, col_pos)` which does overflow-safe indexing and directly gets the bit of the column; 3. `data(1)` -> `fixed_length_rows(row_id)` which first asserts the row table being fixed-length, then does overflow-safe indexing inside; 4. `data(2)` -> `var_length_rows()` which only asserts the row table being var-length. It is supposed to be paired by the `offsets()` (which is already 64-bit by #43389 ); 5. The `data(0/1/2)` members are made private. 3. The AVX2 specializations are fixed individually by using 64-bit multiplication and indexing. ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #45334 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
Rationale for this change
The row table uses
uint32_tas the row offset within the row data buffer, effectively limiting the row data from growing beyond 4GB. This is quite restrictive, and the impact is described in more detail in #43495. This PR proposes to widen the row offset from 32-bit to 64-bit to address this limitation.Benefits
Currently, the row table has three major limitations:
This enhancement will eliminate the first limitation. Meanwhile, the second and third limitations are less likely to occur. Thus, this change will enable a significant range of use cases that are currently unsupported.
Overhead
Of course, this will introduce some overhead:
In my opinion, this overhead is justified by the benefits listed above.
What changes are included in this PR?
Change the row offset of the row table from 32-bit to 64-bit. Relative code in row comparison/encoding and swiss join has been updated accordingly.
Are these changes tested?
Test included.
Are there any user-facing changes?
Users could potentially see higher memory consumption when using acero's hash join and hash aggregation. However, on the other hand, certain use cases used to fail are now able to complete.