Skip to content

Conversation

@zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Jul 23, 2024

Rationale for this change

The row table uses uint32_t as the row offset within the row data buffer, effectively limiting the row data from growing beyond 4GB. This is quite restrictive, and the impact is described in more detail in #43495. This PR proposes to widen the row offset from 32-bit to 64-bit to address this limitation.

Benefits

Currently, the row table has three major limitations:

  1. The overall data size cannot exceed 4GB.
  2. The size of a single row cannot exceed 4GB.
  3. The number of rows cannot exceed 2^32.

This enhancement will eliminate the first limitation. Meanwhile, the second and third limitations are less likely to occur. Thus, this change will enable a significant range of use cases that are currently unsupported.

Overhead

Of course, this will introduce some overhead:

  1. An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit.
  2. A wider offset type requires a few more SIMD instructions in each 8-row processing iteration.

In my opinion, this overhead is justified by the benefits listed above.

What changes are included in this PR?

Change the row offset of the row table from 32-bit to 64-bit. Relative code in row comparison/encoding and swiss join has been updated accordingly.

Are these changes tested?

Test included.

Are there any user-facing changes?

Users could potentially see higher memory consumption when using acero's hash join and hash aggregation. However, on the other hand, certain use cases used to fail are now able to complete.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found that the functions in this file are not used anywhere. I updated it still though - maybe I'll clean it up or make it useful in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do you mean the AVX2 speedups are actually not wired anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, they are not as far as I searched the code.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 24, 2024
@zanmato1984 zanmato1984 changed the title EXPERIMENT: [C++] Widen the row offset of row table to 64-bit EXPERIMENT: [C++][Compute] Widen the row offset of row table to 64-bit Jul 24, 2024
@zanmato1984
Copy link
Contributor Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jul 29, 2024

Benchmark runs are scheduled for commit a789e04e964dacf02e76df3cfd48a59024843fce. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a789e04e964dacf02e76df3cfd48a59024843fce.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details.

@zanmato1984 zanmato1984 changed the title EXPERIMENT: [C++][Compute] Widen the row offset of row table to 64-bit GH-43495: [C++][Compute] Widen the row offset of row table to 64-bit Jul 31, 2024
@zanmato1984 zanmato1984 changed the title GH-43495: [C++][Compute] Widen the row offset of row table to 64-bit GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit Jul 31, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review July 31, 2024 09:43
@zanmato1984 zanmato1984 requested a review from westonpace as a code owner July 31, 2024 09:43
@zanmato1984
Copy link
Contributor Author

The ASAN failure is because #43414. Others seem unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this assertion will fire for 32-bit offset, because there are unreported offset overflow issues in fix-length code path as described in #43495, quoted below:

Even for the fixed-length code path, which doesn’t deal with the offset buffer at all and thus is supposed to be less problematic, there are obvious offset overflow issues like [1] and [2] (these issues are currently unreported but observed in my local experiments).
[1]

uint32_t offset_right = irow_right * fixed_length + offset_within_row;

[2]
__m256i offset_right =
_mm256_mullo_epi32(irow_right, _mm256_set1_epi32(fixed_length));

@github-actions
Copy link

⚠️ GitHub issue #43495 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv self-requested a review July 31, 2024 17:10
@zanmato1984 zanmato1984 force-pushed the row-table-large-offset branch from 0e3a361 to 2da4b07 Compare August 3, 2024 08:49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do you mean the AVX2 speedups are actually not wired anywhere?

row_ids_out.data(), columns, row_table,
/*are_cols_in_encoding_order=*/true,
/*out_match_bitvector_maybe_null=*/NULLPTR);
ASSERT_EQ(num_rows_no_match, 0);
Copy link
Member

@pitrou pitrou Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this means that the rows and columns compared equal, right? I'm trying to understand the API...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It returns the rows not equal to columns.

/*out_num_rows=*/NULLPTR, /*out_sel_left_maybe_same=*/NULLPTR, columns, row_table,
/*are_cols_in_encoding_order=*/true, match_bitvector.data());
ASSERT_EQ(CountSetBits(match_bitvector.data(), 0, num_rows_to_compare),
num_rows_to_compare);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the docstring says that CompareColumnsToRows returns "a single 16-bit selection vector of rows that failed comparison". If rows are equal to columns, then surely all bits should be 0, not 1? Am I misunderstanding something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string is misleading. It actually returns the result in either of two ways: selection vector of the non-match row ids (as the doc string says), and the match bit vector (1 for match and 0 otherwise, this is not mentioned in the doc string), depending on if the output parameter of row ids is not null. And this code is checking using the second way.

I can update the doc string to make it more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a weird API. Yes, updating the doc string would be nice :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

CI failures look unrelated, FTR.

@zanmato1984
Copy link
Contributor Author

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: ded3e67

Submitted crossbow builds: ursacomputing/crossbow @ actions-c44c9b7d7c

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 16, 2024
@vkhodygo
Copy link

Fab, a lot of people will be happy with this one.

A few questions here:

  • there was a lengthy discussion (and a document) about larger than memory datasets GH-31769: [C++][Acero] Add spilling for hash join #13669 [C++] Support hash-join on larger than memory datasets #31769, will there be any progress in this direction? Now that the limit is gone I expect an influx of reports about crashed code solely because of the dataset size.

  • An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit.
    A wider offset type requires a few more SIMD instructions in each 8-row processing iteration.

    Do you think it's possible to add some heuristics and/or an explicit key to keep the old behaviour for reasonably small datasets?

@zanmato1984
Copy link
Contributor Author

Thanks for the feedback!

There is unfortunately no update nor plan about it AFAIK. By "crashed code solely because of the data size" if you mean the oom-kill of the operating system, then yes :)

  • An extra 4 bytes of memory consumption for each row due to the offset size difference from 32-bit to 64-bit.
    A wider offset type requires a few more SIMD instructions in each 8-row processing iteration.

    Do you think it's possible to add some heuristics and/or an explicit key to keep the old behaviour for reasonably small datasets?

Great question. I've thought about that too and had a short discussion with other developers in the community meeting. My reasons for not doing this are: 1) code complexity: there are lots of code that assume the offset type to be concrete, esp. in SIMD specialized functions, generalizing them is not trivial; 2) runtime overhead: since the data size is very likely unknown before the computation, the bottom line would be to use the 32-bit offset until certain limit (4GB) has reached as the data accumulates, by when we switch to 64-bit offset - a huge memory copy from the 32-bit offset buffer to the 64-bit offset buffer would happen. Thus the decision.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5e68513.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 40 possible false positives for unstable benchmarks that are known to sometimes produce them.

@kou
Copy link
Member

kou commented Sep 13, 2024

@github-actions crossbow submit test-r-rstudio-r-base-4.1-opensuse155

@github-actions
Copy link

Revision: ded3e67

Submitted crossbow builds: ursacomputing/crossbow @ actions-947c28ffed

Task Status
test-r-rstudio-r-base-4.1-opensuse155 Azure

zanmato1984 added a commit that referenced this pull request Dec 25, 2024
### Rationale for this change

A piece of comment is wrongly introduced in #43389.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
zanmato1984 added a commit that referenced this pull request Jan 27, 2025
…alculation for fixed length and null masks (#45336)

### Rationale for this change

#45334 

### What changes are included in this PR?

1. An all-mighty test case that can effectively reveal all the bugs mentioned in the issue;
2. Other than directly fixing the bugs (actually simply casting to 64-bit somewhere in the multiplication will do), I did some refinement to the buffer accessors of the row table, in order to eliminate more potential similar issues (which I believe do exist):
    1. `null_masks()` -> `null_masks(row_id)` which does overflow-safe indexing inside;
    2. `is_null(row_id, col_pos)` which does overflow-safe indexing and directly gets the bit of the column;
    3. `data(1)` -> `fixed_length_rows(row_id)` which first asserts the row table being fixed-length, then does overflow-safe indexing inside;
    4. `data(2)` -> `var_length_rows()` which only asserts the row table being var-length. It is supposed to be paired by the `offsets()` (which is already 64-bit by #43389 );
    5. The `data(0/1/2)` members are made private.
3. The AVX2 specializations are fixed individually by using 64-bit multiplication and indexing.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.

* GitHub Issue: #45334

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants