-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Describe the enhancement requested
We’ve seen several reports about the hash join not working for large inputs (e.g., #34474, #37655, and #36995). The reason turns out to be that the row table (the hash table for the hash join) uses uint32_t to represent the row offset within the row data buffer, effectively preventing the row data from exceeding 4GB.
What makes things worse is that, when this limitation is exceeded, users can barely workaround it by regular methods like "splitting the input into smaller batches," which works for many other issues. Because the row table accumulates all the input data, smaller batches do not change the overall data size.
There are also some other aspects:
- 32-bit row offset also makes the row table implementation error-prone. For example, GH-34474: [C++] Detect and raise an error if a join will need too much key data #35087, GH-41813: [C++] Fix avx2 gather offset larger than 2GB in
CompareColumnsToRows#42188, and GH-43202: [C++][Compute] Detect and explicit error for offset overflow in row table #43226 are fixes to address or detect certain edge cases related to the row offset. Even for the fixed-length code path, which doesn’t deal with the offset buffer at all and thus is supposed to be less problematic, there are obvious offset overflow issues like [1] and [2] (these issues are currently unreported but observed in my local experiments). - We are going to support large-offset data types for the row table eventually, as requested in [C++] Add hash-join support for large-offset types (e.g. large_string) and dictionary encoded types to the new hash-join impl #31622.
Therefore, we should consider widening the row offset of the row table to 64-bit.
[1]
| uint32_t offset_right = irow_right * fixed_length + offset_within_row; |
[2]
arrow/cpp/src/arrow/compute/row/compare_internal_avx2.cc
Lines 243 to 244 in 187197c
| __m256i offset_right = | |
| _mm256_mullo_epi32(irow_right, _mm256_set1_epi32(fixed_length)); |
Component(s)
C++