Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Jan 23, 2025

Rationale for this change

#45334

What changes are included in this PR?

  1. An all-mighty test case that can effectively reveal all the bugs mentioned in the issue;
  2. Other than directly fixing the bugs (actually simply casting to 64-bit somewhere in the multiplication will do), I did some refinement to the buffer accessors of the row table, in order to eliminate more potential similar issues (which I believe do exist):
    1. null_masks() -> null_masks(row_id) which does overflow-safe indexing inside;
    2. is_null(row_id, col_pos) which does overflow-safe indexing and directly gets the bit of the column;
    3. data(1) -> fixed_length_rows(row_id) which first asserts the row table being fixed-length, then does overflow-safe indexing inside;
    4. data(2) -> var_length_rows() which only asserts the row table being var-length. It is supposed to be paired by the offsets() (which is already 64-bit by GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit #43389 );
    5. The data(0/1/2) members are made private.
  3. The AVX2 specializations are fixed individually by using 64-bit multiplication and indexing.

Are these changes tested?

Yes.

Are there any user-facing changes?

None.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 changed the title GH45334: [C++][Acero] Fix swiss join overflow issues in row offset calculation for fixed length and null masks GH-45334: [C++][Acero] Fix swiss join overflow issues in row offset calculation for fixed length and null masks Jan 23, 2025
@zanmato1984 zanmato1984 marked this pull request as draft January 23, 2025 03:18
@github-actions
Copy link

⚠️ GitHub issue #45334 has been automatically assigned in GitHub to PR creator.

@zanmato1984 zanmato1984 marked this pull request as ready for review January 23, 2025 04:04
@zanmato1984
Copy link
Contributor Author

Hi @pitrou , mind to take a look? Thanks.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 23, 2025
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new helpers increase readability a bit, thank you.

I haven't looked at the test, but otherwise looks good in principle. See comments below.

}
return NULLPTR;

inline const uint8_t* null_masks(uint32_t row_id) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that inline is implied when the method definition is inside the class declaration, there's no need to put it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return buffers_[i]->mutable_data();
}
return NULLPTR;
inline uint8_t* null_masks(uint32_t row_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be called mutable_null_masks, for consistency with other helper methods below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case 1:
for (uint32_t i = 0; i < num_rows; ++i) {
col_base[i] = row_base[i * row_size];
col_base[i] = *rows.fixed_length_rows(start_row + i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... we're not adding offset_within_row anymore? Is it deliberate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was a mistake. Addressed. Thank you for spotting this!

memcpy(target->mutable_fixed_length_rows(static_cast<uint32_t>(first_target_row_id)),
source.fixed_length_rows(/*row_id=*/0), fixed_length * num_source_rows);
} else {
// Row length must be a multiple of 64-bits due to enforced alignment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but that's an interesting piece of info. Doesn't it blow up memory use if there is a small number of very small columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is true. But IMHO this is acceptable because we also have other auxiliary data structures to aim the hash join so I wouldn't say this is very bad.

if (!source_rows_permutation) {
memcpy(target->mutable_data(1) + fixed_length * first_target_row_id, source.data(1),
fixed_length * num_source_rows);
DCHECK_LE(first_target_row_id, std::numeric_limits<uint32_t>::max());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these size constraints already implied but not tested for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume row id to be uint32_t (that means no more than 2^32 rows are allowed) almost everywhere, so this is implied. But there are still some places weirdly and unnecessarily using int64_t as row id, here included.

int64_t num_words_per_row = fixed_length / sizeof(uint64_t);
for (int64_t i = 0; i < num_source_rows; ++i) {
int64_t source_row_id = source_rows_permutation[i];
DCHECK_LE(source_row_id, std::numeric_limits<uint32_t>::max());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's always the case then are we wasting memory and CPU cache by having a 64-bit permutation array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are. In fact I have another WIP branch trying to clean them up.

__m256i bit_id_hi =
_mm256_mul_epi32(irow_right_hi, _mm256_set1_epi64x(null_mask_num_bytes * 8));
bit_id_lo = _mm256_add_epi64(bit_id_lo, pos_after_encoding);
bit_id_hi = _mm256_add_epi64(bit_id_hi, pos_after_encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to have already seen this kind of code in other PRs, perhaps we can add helpers to make it more readable (perhaps in a avx2_internal.h header?):

inline std::pair<__m256i, __m256i> WidenInt32to64(__m256 x) {
  __m256i x_lo = _mm256_cvtepi32_epi64(_mm256_castsi256_si128(x));
  __m256i x_hi =
      _mm256_cvtepi32_epi64(_mm256_extracti128_si256(x, 1));
  return {x_lo, x_hi};
}

// Compute `x * factor + addend` in the 64-bit domain
inline __m256i MulAddInt64(__m256 x, __m256 factor, __m256 addend) {
  return _mm256_add_epi64(addend, _mm256_mul_epi32(x, factor));
}

inline __m256i MulAddInt64(__m256 x, int64_t factor, int64_t addend) {
  return MulAddInt64(x, _mm256_set1_epi64x(factor), _mm256_set1_epi64x(addend));
}

then the code above could look like:

  auto {irow_right_lo, irow_right_hi} = WidenInt32to64(irow_right);
  auto bit_id_lo = MulAddInt64(irow_right_lo, null_mask_num_bytes * 8, pos_after_encoding);
  auto bit_id_hi = MulAddInt64(irow_right_hi, null_mask_num_bytes * 8, pos_after_encoding);

(and in an ideal future, this would be migrated to xsimd :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a lot of legacy code can take advantage of these helpers. But they might be too much to put into this PR. Can I do it in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the helper in the other comment though, which is independent of the ones suggested in this comment. (I think these ones are more "basic" than that one which is more row table specific.)

__m128i right_hi = _mm256_i64gather_epi32(reinterpret_cast<const int*>(null_masks),
_mm256_srli_epi64(bit_id_hi, 3), 1);
__m256i right = _mm256_set_m128i(right_hi, right_lo);
right = _mm256_and_si256(right, bit_in_right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also perhaps a more general helper:

// Get null bits at `null_bit_id` as a vector of 32-bit ints
__m256i GetRowNullBitsInt32(const RowTableImpl& rows, uint32_t null_bit_id, __m256 row_index32);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I made the helper you suggested and put it in a common header. Other than this file, there is one more piece of code in swiss_join_avx2.cc can reuse it. Pretty nice.

I also made two helper functions Cmp32/64To8 local in compare_internal_avx2.cc that also save some LOC.

@zanmato1984
Copy link
Contributor Author

Thank you @pitrou . Please hold a bit. The CI failure seems related, meanwhile I'm still working on enhancing the added test to cover not only payload columns.

Will update and address your comment soon.

@zanmato1984
Copy link
Contributor Author

Hi @pitrou . Hopefully the CI should be fixed now. And the test has been enhanced to the extent that can also discover the overflow in key columns (the position of the key in row table depends on the hashing algorithm so it took me a while to tune a matching key value that will locate on the higher address in the row table - note the 289339070 in the test). I tested it by reverting several places of the fix in my local and the overflow did happen.

I'm addressing your comments now. Will update soon.

@zanmato1984
Copy link
Contributor Author

The R failures seem related. Will take a look.

@zanmato1984
Copy link
Contributor Author

The R failures seem related. Will take a look.

Fixed.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor suggestion

uint32_t null32_lo =
_mm256_movemask_epi8(_mm256_cvtepi32_epi64(_mm256_castsi256_si128(null32)));
uint32_t null32_hi =
_mm256_movemask_epi8(_mm256_cvtepi32_epi64(_mm256_extracti128_si256(null32, 1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably reuse Cmp32to8 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Moved Cmp32/64To8 to common header and used it here. Thank you.

@zanmato1984
Copy link
Contributor Author

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: f7df7a4

Submitted crossbow builds: ursacomputing/crossbow @ actions-6bfabdccd3

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@zanmato1984 zanmato1984 merged commit b818560 into apache:main Jan 27, 2025
37 checks passed
@zanmato1984 zanmato1984 removed the awaiting committer review Awaiting committer review label Jan 27, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b818560.

There were 8 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 added a commit that referenced this pull request Feb 10, 2025
…table (#45473)

### Rationale for this change

The failure reported in #45393 seems to be caused by a careless parentheses typo introduced in #45336:
https://github.com/apache/arrow/blob/e32f56b478171fc4b53dc2042c4cf5d37c97e351/cpp/src/arrow/compute/row/encode_internal.cc#L281-L282

And unfortunately our `Grouper` UT doesn't have cases covering this particular code path (the questioning code path is only triggered in `Grouper` with very restrictive conditions: the row table is fixed-length, a 32-bit key is encoded after some other keys).

### What changes are included in this PR?

An UT to reproduce the issue and the fix.

### Are these changes tested?

UT included.

### Are there any user-facing changes?

None.

* GitHub Issue: #45393

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
pitrou pushed a commit that referenced this pull request Jun 2, 2025
…w-compute-row-test (#46635)

### Rationale for this change

In #45336 we refined the row table buffer accessors and enforced the validation on who can call the `var_length_rows()` method. However a legacy test `CompareColumnsToRowsOver4GBFixedLength` is leveraging this accessor to assert this buffer being null.

### What changes are included in this PR?

We can just check if the row table is fixed length.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.

* GitHub Issue: #46623

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants