-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-45334: [C++][Acero] Fix swiss join overflow issues in row offset calculation for fixed length and null masks #45336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45334: [C++][Acero] Fix swiss join overflow issues in row offset calculation for fixed length and null masks #45336
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
Hi @pitrou , mind to take a look? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new helpers increase readability a bit, thank you.
I haven't looked at the test, but otherwise looks good in principle. See comments below.
| } | ||
| return NULLPTR; | ||
|
|
||
| inline const uint8_t* null_masks(uint32_t row_id) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that inline is implied when the method definition is inside the class declaration, there's no need to put it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return buffers_[i]->mutable_data(); | ||
| } | ||
| return NULLPTR; | ||
| inline uint8_t* null_masks(uint32_t row_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be called mutable_null_masks, for consistency with other helper methods below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| case 1: | ||
| for (uint32_t i = 0; i < num_rows; ++i) { | ||
| col_base[i] = row_base[i * row_size]; | ||
| col_base[i] = *rows.fixed_length_rows(start_row + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... we're not adding offset_within_row anymore? Is it deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this was a mistake. Addressed. Thank you for spotting this!
| memcpy(target->mutable_fixed_length_rows(static_cast<uint32_t>(first_target_row_id)), | ||
| source.fixed_length_rows(/*row_id=*/0), fixed_length * num_source_rows); | ||
| } else { | ||
| // Row length must be a multiple of 64-bits due to enforced alignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but that's an interesting piece of info. Doesn't it blow up memory use if there is a small number of very small columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is true. But IMHO this is acceptable because we also have other auxiliary data structures to aim the hash join so I wouldn't say this is very bad.
| if (!source_rows_permutation) { | ||
| memcpy(target->mutable_data(1) + fixed_length * first_target_row_id, source.data(1), | ||
| fixed_length * num_source_rows); | ||
| DCHECK_LE(first_target_row_id, std::numeric_limits<uint32_t>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these size constraints already implied but not tested for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume row id to be uint32_t (that means no more than 2^32 rows are allowed) almost everywhere, so this is implied. But there are still some places weirdly and unnecessarily using int64_t as row id, here included.
| int64_t num_words_per_row = fixed_length / sizeof(uint64_t); | ||
| for (int64_t i = 0; i < num_source_rows; ++i) { | ||
| int64_t source_row_id = source_rows_permutation[i]; | ||
| DCHECK_LE(source_row_id, std::numeric_limits<uint32_t>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's always the case then are we wasting memory and CPU cache by having a 64-bit permutation array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are. In fact I have another WIP branch trying to clean them up.
| __m256i bit_id_hi = | ||
| _mm256_mul_epi32(irow_right_hi, _mm256_set1_epi64x(null_mask_num_bytes * 8)); | ||
| bit_id_lo = _mm256_add_epi64(bit_id_lo, pos_after_encoding); | ||
| bit_id_hi = _mm256_add_epi64(bit_id_hi, pos_after_encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to have already seen this kind of code in other PRs, perhaps we can add helpers to make it more readable (perhaps in a avx2_internal.h header?):
inline std::pair<__m256i, __m256i> WidenInt32to64(__m256 x) {
__m256i x_lo = _mm256_cvtepi32_epi64(_mm256_castsi256_si128(x));
__m256i x_hi =
_mm256_cvtepi32_epi64(_mm256_extracti128_si256(x, 1));
return {x_lo, x_hi};
}
// Compute `x * factor + addend` in the 64-bit domain
inline __m256i MulAddInt64(__m256 x, __m256 factor, __m256 addend) {
return _mm256_add_epi64(addend, _mm256_mul_epi32(x, factor));
}
inline __m256i MulAddInt64(__m256 x, int64_t factor, int64_t addend) {
return MulAddInt64(x, _mm256_set1_epi64x(factor), _mm256_set1_epi64x(addend));
}then the code above could look like:
auto {irow_right_lo, irow_right_hi} = WidenInt32to64(irow_right);
auto bit_id_lo = MulAddInt64(irow_right_lo, null_mask_num_bytes * 8, pos_after_encoding);
auto bit_id_hi = MulAddInt64(irow_right_hi, null_mask_num_bytes * 8, pos_after_encoding);(and in an ideal future, this would be migrated to xsimd :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a lot of legacy code can take advantage of these helpers. But they might be too much to put into this PR. Can I do it in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the helper in the other comment though, which is independent of the ones suggested in this comment. (I think these ones are more "basic" than that one which is more row table specific.)
| __m128i right_hi = _mm256_i64gather_epi32(reinterpret_cast<const int*>(null_masks), | ||
| _mm256_srli_epi64(bit_id_hi, 3), 1); | ||
| __m256i right = _mm256_set_m128i(right_hi, right_lo); | ||
| right = _mm256_and_si256(right, bit_in_right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also perhaps a more general helper:
// Get null bits at `null_bit_id` as a vector of 32-bit ints
__m256i GetRowNullBitsInt32(const RowTableImpl& rows, uint32_t null_bit_id, __m256 row_index32);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I made the helper you suggested and put it in a common header. Other than this file, there is one more piece of code in swiss_join_avx2.cc can reuse it. Pretty nice.
I also made two helper functions Cmp32/64To8 local in compare_internal_avx2.cc that also save some LOC.
|
Thank you @pitrou . Please hold a bit. The CI failure seems related, meanwhile I'm still working on enhancing the added test to cover not only payload columns. Will update and address your comment soon. |
|
Hi @pitrou . Hopefully the CI should be fixed now. And the test has been enhanced to the extent that can also discover the overflow in key columns (the position of the key in row table depends on the hashing algorithm so it took me a while to tune a matching key value that will locate on the higher address in the row table - note the I'm addressing your comments now. Will update soon. |
|
The R failures seem related. Will take a look. |
Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor suggestion
| uint32_t null32_lo = | ||
| _mm256_movemask_epi8(_mm256_cvtepi32_epi64(_mm256_castsi256_si128(null32))); | ||
| uint32_t null32_hi = | ||
| _mm256_movemask_epi8(_mm256_cvtepi32_epi64(_mm256_extracti128_si256(null32, 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably reuse Cmp32to8 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Moved Cmp32/64To8 to common header and used it here. Thank you.
|
@github-actions crossbow submit -g cpp |
|
Revision: f7df7a4 Submitted crossbow builds: ursacomputing/crossbow @ actions-6bfabdccd3 |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b818560. There were 8 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…table (#45473) ### Rationale for this change The failure reported in #45393 seems to be caused by a careless parentheses typo introduced in #45336: https://github.com/apache/arrow/blob/e32f56b478171fc4b53dc2042c4cf5d37c97e351/cpp/src/arrow/compute/row/encode_internal.cc#L281-L282 And unfortunately our `Grouper` UT doesn't have cases covering this particular code path (the questioning code path is only triggered in `Grouper` with very restrictive conditions: the row table is fixed-length, a 32-bit key is encoded after some other keys). ### What changes are included in this PR? An UT to reproduce the issue and the fix. ### Are these changes tested? UT included. ### Are there any user-facing changes? None. * GitHub Issue: #45393 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
…w-compute-row-test (#46635) ### Rationale for this change In #45336 we refined the row table buffer accessors and enforced the validation on who can call the `var_length_rows()` method. However a legacy test `CompareColumnsToRowsOver4GBFixedLength` is leveraging this accessor to assert this buffer being null. ### What changes are included in this PR? We can just check if the row table is fixed length. ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #46623 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
#45334
What changes are included in this PR?
null_masks()->null_masks(row_id)which does overflow-safe indexing inside;is_null(row_id, col_pos)which does overflow-safe indexing and directly gets the bit of the column;data(1)->fixed_length_rows(row_id)which first asserts the row table being fixed-length, then does overflow-safe indexing inside;data(2)->var_length_rows()which only asserts the row table being var-length. It is supposed to be paired by theoffsets()(which is already 64-bit by GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit #43389 );data(0/1/2)members are made private.Are these changes tested?
Yes.
Are there any user-facing changes?
None.