Skip to content

Conversation

@tdavidcl
Copy link
Member

@tdavidcl tdavidcl commented Jan 1, 2026

extracted from #1484

@gemini-code-assist
Copy link

Summary of Changes

Hello @tdavidcl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new in-place sparse communication algorithm into the shamalgs library, providing a highly efficient method for selective data exchange between processes. It includes robust mechanisms for managing asynchronous MPI requests, such as limiting concurrent operations and handling timeouts, and supports both host and device buffers with considerations for direct GPU MPI capabilities. The changes are thoroughly validated through a new set of unit tests.

Highlights

  • New In-place Sparse Communication Algorithm: Introduces a new sparse_exchange algorithm designed for efficient, in-place data transfer between specific ranks in a distributed environment, optimizing memory usage and communication patterns.
  • Enhanced MPI Request Management: The RequestList class has been augmented with new functionalities, including spin_lock_partial_wait to limit the number of in-flight MPI requests, report_timeout for better error handling, and remain_count_no_test for more flexible status checks.
  • Comprehensive Unit Testing: A dedicated unit test file (sparse_exchange_tests.cpp) has been added, covering various communication scenarios such as self-send, ring communication, and random sender/receiver pairs, ensuring the robustness and correctness of the new algorithm.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new in-place sparse communication algorithm, which is a significant piece of functionality. The implementation includes building a communication table and managing MPI requests. The changes are well-structured and include comprehensive tests. I've found a few issues, including two critical bugs in buffer size validation that could lead to memory corruption, a couple of medium-severity logic and API design issues in the request handling and communication table construction. My detailed comments and suggestions for fixes are provided below.

Comment on lines +232 to +238
if (comm_table.send_total_size < bytebuffer_send.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The send total size is greater than the send buffer size\n"
" send_total_size = {}, send_buffer_size = {}",
comm_table.send_total_size,
bytebuffer_send.get_size()));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The buffer size check is incorrect. It should throw an exception if comm_table.send_total_size is greater than bytebuffer_send.get_size(). The current check comm_table.send_total_size < bytebuffer_send.get_size() is for a valid condition where the buffer is larger than needed. The error message correctly describes the condition that should be checked (is greater than), but the if condition is the opposite. This is a critical bug that can lead to buffer overflows.

Suggested change
if (comm_table.send_total_size < bytebuffer_send.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The send total size is greater than the send buffer size\n"
" send_total_size = {}, send_buffer_size = {}",
comm_table.send_total_size,
bytebuffer_send.get_size()));
}
if (comm_table.send_total_size > bytebuffer_send.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The send total size is greater than the send buffer size\n"
" send_total_size = {}, send_buffer_size = {}",
comm_table.send_total_size,
bytebuffer_send.get_size()));
}

Comment on lines +240 to +246
if (comm_table.recv_total_size < bytebuffer_recv.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The recv total size is greater than the recv buffer size\n"
" recv_total_size = {}, recv_buffer_size = {}",
comm_table.recv_total_size,
bytebuffer_recv.get_size()));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the send buffer check, this size check for the receive buffer is incorrect. It should throw if comm_table.recv_total_size is greater than bytebuffer_recv.get_size(). The current condition is the opposite of what's needed to prevent a buffer overflow when receiving data.

Suggested change
if (comm_table.recv_total_size < bytebuffer_recv.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The recv total size is greater than the recv buffer size\n"
" recv_total_size = {}, recv_buffer_size = {}",
comm_table.recv_total_size,
bytebuffer_recv.get_size()));
}
if (comm_table.recv_total_size > bytebuffer_recv.get_size()) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The recv total size is greater than the recv buffer size\n"
" recv_total_size = {}, recv_buffer_size = {}",
comm_table.recv_total_size,
bytebuffer_recv.get_size()));
}

Comment on lines 65 to -66
shamcomm::mpi::Waitall(
shambase::narrow_or_throw<i32>(rqs.size()), rqs.data(), st_lst.data());
ready_count = rqs.size();
is_ready.assign(rqs.size(), true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

After shamcomm::mpi::Waitall completes, all MPI requests are finished. The internal state of RequestList (ready_count and is_ready) should be updated to reflect this. By removing these lines, other member functions like remain_count_no_test() or all_ready() will return incorrect information if called after wait_all(). This makes the class state inconsistent and can lead to subtle bugs if this class is reused elsewhere. Please restore this state update.

Comment on lines +88 to +120
void spin_lock_partial_wait(size_t max_in_flight, f64 timeout, f64 print_freq) {

if (rqs.size() < max_in_flight) {
return;
}

f64 last_print_time = 0;
size_t in_flight = remain_count();

if (in_flight < max_in_flight) {
return;
}

shambase::Timer twait;
twait.start();
do {
twait.end();
if (twait.elasped_sec() > timeout) {
report_timeout();
}

if (twait.elasped_sec() - last_print_time > print_freq) {
logger::warn_ln(
"SparseComm",
"too many messages in flight :",
in_flight,
"/",
max_in_flight);
last_print_time = twait.elasped_sec();
}
in_flight = remain_count();
} while (in_flight >= max_in_flight);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in this spin lock can be simplified and made more correct. The current implementation uses a stale value of in_flight for logging because it's updated at the end of the loop. This can be misleading during debugging. Additionally, there are some redundant checks before the loop. A while loop would be cleaner, more efficient, and ensure the logged information is always up-to-date.

        void spin_lock_partial_wait(size_t max_in_flight, f64 timeout, f64 print_freq) {

            if (rqs.size() < max_in_flight) {
                return;
            }

            shambase::Timer twait;
            twait.start();
            f64 last_print_time = 0;
            size_t in_flight;

            while ((in_flight = remain_count()) >= max_in_flight) {
                twait.end();
                if (twait.elasped_sec() > timeout) {
                    report_timeout();
                }

                if (twait.elasped_sec() - last_print_time > print_freq) {
                    logger::warn_ln(
                        "SparseComm",
                        "too many messages in flight :",
                        in_flight,
                        "/",
                        max_in_flight);
                    last_print_time = twait.elasped_sec();
                }
            }
        }

Comment on lines +143 to +154
// the sender shoudl have set the offset for all messages, otherwise throw
auto expected_offset = shambase::get_check_ref(
messages_send.at(send_idx).message_bytebuf_offset_send);

// check that the send offset match for good measure
if (message_info.message_bytebuf_offset_send != expected_offset) {
throw shambase::make_except_with_loc<std::invalid_argument>(shambase::format(
"The sender has not set the offset for all messages, otherwise throw\n"
" expected_offset = {}, actual_offset = {}",
expected_offset,
message_info.message_bytebuf_offset_send.value()));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check verifies that the message_bytebuf_offset_send provided in the input messages_send matches the offset calculated within this function. This makes the API fragile, as it requires the caller to pre-calculate and provide correct send offsets. A more robust design would be for this function to be solely responsible for calculating the offsets, ignoring any that might be provided in the input. This would simplify the caller's responsibility to only providing message sizes, senders, and receivers. Consider removing this check and making the function calculate the offsets authoritatively.

Additionally, there's a typo "shoudl" in the comment on line 143, and the error message on line 150 is unclear.

@github-actions
Copy link

github-actions bot commented Jan 1, 2026

Thanks @tdavidcl for opening this PR!

You can do multiple things directly here:
1 - Comment pre-commit.ci run to run pre-commit checks.
2 - Comment pre-commit.ci autofix to apply fixes.
3 - Add label autofix.ci to fix authorship & pre-commit for every commit made.
4 - Add label light-ci to only trigger a reduced & faster version of the CI (need the full one before merge).
5 - Add label trigger-ci to create an empty commit to trigger the CI.

Once the workflow completes a message will appear displaying informations related to the run.

Also the PR gets automatically reviewed by gemini, you can:
1 - Comment /gemini review to trigger a review
2 - Comment /gemini summary for a summary
3 - Tag it using @gemini-code-assist either in the PR or in review comments on files

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Workflow report

workflow report corresponding to commit 1174f0a
Commiter email is 41898282+github-actions[bot]@users.noreply.github.com
You are using github private e-mail. This prevent proper tracing of who contributed what, please disable it (see Keep my email addresses private).
GitHub page artifact URL GitHub page artifact link (can expire)

Pre-commit check report

Pre-commit check: ✅

trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check for broken symlinks................................................Passed
check yaml...............................................................Passed
detect private key.......................................................Passed
No-tabs checker..........................................................Passed
Tabs remover.............................................................Passed
Validate GitHub Workflows................................................Passed
clang-format.............................................................Passed
black....................................................................Passed
ruff check...............................................................Passed
Check doxygen headers....................................................Passed
Check license headers....................................................Passed
Check #pragma once.......................................................Passed
Check SYCL #include......................................................Passed
No ssh in git submodules remote..........................................Passed

Test pipeline can run.

Clang-tidy diff report

No relevant changes found.
Well done!

You should now go back to your normal life and enjoy a hopefully sunny day while waiting for the review.

Doxygen diff with main

Removed warnings : 10
New warnings : 23
Warnings count : 7622 → 7635 (0.2%)

Detailed changes :
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:25: warning: Compound shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:27: warning: Compound shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:33: warning: Member new_request() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:35: warning: Member new_request() (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:39: warning: Member size() const (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:40: warning: Member is_event_ready(size_t i) const (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:41: warning: Member requests() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:41: warning: Member size() const (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:42: warning: Member is_event_ready(size_t i) const (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:43: warning: Member requests() (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:43: warning: Member test_ready() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:45: warning: Member test_ready() (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:56: warning: Member all_ready() const (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:58: warning: Member all_ready() const (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:58: warning: Member wait_all() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:60: warning: Member wait_all() (function) of class shamalgs::collective::RequestList is not documented.
- src/shamalgs/include/shamalgs/collective/RequestList.hpp:69: warning: Member remain_count() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:69: warning: Member remain_count_no_test() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:71: warning: Member remain_count() (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:76: warning: Member report_timeout() const (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/RequestList.hpp:88: warning: Member spin_lock_partial_wait(size_t max_in_flight, f64 timeout, f64 print_freq) (function) of class shamalgs::collective::RequestList is not documented.
+ src/shamalgs/include/shamalgs/collective/sparse_exchange.hpp:25: warning: Compound shamalgs::collective::CommMessageInfo is not documented.
+ src/shamalgs/include/shamalgs/collective/sparse_exchange.hpp:36: warning: Compound shamalgs::collective::CommTable is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:177: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, const u8 *bytebuffer_send, u8 *bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:224: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, sham::DeviceBuffer< u8, target > &bytebuffer_send, sham::DeviceBuffer< u8, target > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:224: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, sham::DeviceBuffer< u8, target > &bytebuffer_send, sham::DeviceBuffer< u8, target > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:268: warning: Member sparse_exchange< sham::device >(std::shared_ptr< sham::DeviceScheduler > dev_sched, sham::DeviceBuffer< u8, sham::device > &bytebuffer_send, sham::DeviceBuffer< u8, sham::device > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:274: warning: Member sparse_exchange< sham::host >(std::shared_ptr< sham::DeviceScheduler > dev_sched, sham::DeviceBuffer< u8, sham::host > &bytebuffer_send, sham::DeviceBuffer< u8, sham::host > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:32: warning: Member unpack(u64_2 comm_info) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:56: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:56: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send) (function) of namespace shamalgs::collective is not documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant