Skip to content

Conversation

@KADichev
Copy link
Collaborator

No description provided.

@KADichev KADichev requested a review from anyzelman December 20, 2024 07:00
@KADichev KADichev linked an issue Jan 17, 2025 that may be closed by this pull request
Copy link
Member

@anyzelman anyzelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review my side -- couple of main things:

  1. this MR adds primitives to the core API that should either be encapsulated in a standard lpf_sync OR moved to an LPF extension header (and, as a consequence, some other engines need not be modified);
  2. the reframe & CI additions seem quite site-specific and therefore better hidden from upstream; and
  3. I'm not sure if the new collective should be part of this MR?

Also, at times I'm not sure if I'm looking at an up-to-date MR (e.g. due to lpf_abort here being defined in core.h where I thought this was moved to an extension already?)

@anyzelman
Copy link
Member

argh... most of my review details seems lost -.-

This was referenced Feb 12, 2025
KADichev and others added 24 commits February 12, 2025 18:18
…quests in a queue pair), not just relying on device information about max_qp_wr, but actually trying to create QPs via ibv_create_qp with different max_send_wr until we find the largest still working number (via binary search). This becomes the updated m_maxSrs. Independently, the 100K element test manyPuts needs to be downgraded to 5K for our cluster, as our count is just over 10K, but actually 10K does not work as well (not sure why?)
…so that these could be assigned to different LPF functions (e.g., trigger send early by moving ibv_post_send calls into IBVerbs::put
…(hopefully) through integrating BSC changes to enable both local and remote completion queues, which is key if we want to read the number of messages received or posted.
…slot. This is currently done via imm_data field which carries the memory slot ID of the destination at the sender before it is RDMA written. After a poll finds that a message has been received, the imm_data entry is being read and used as a key for a hash table, where the value is the number of receives (being incremented at each receive at the right key). The lookup at the receiver is then just a lookup of this hash table. There is currently a problem in lines around 840 of mesgqueue.cpp, where the destination ID is being reset to zero. This needs to be solved.

Trying to resolve conflicts between old addition of get received message
count and new abort functionality for tests. For now, removing the get
received functionality, because I am not really convinced we need it.
…dified slot ID if edge buffer is used. The original slot ID is then only used as a key for hashtable with key = slot ID and value = number of received messages
…ut directly calls IBVerbs put, and LPF sync only waits on the local completion of IBVerbs put (via polling that the message has been sent -- but no confirmation exists the message has been received). I still keep one barrier in the IBVerbs::sync for synchronicity, but this barrier should be removed in the future.
…within LPF as I need it. 2) Add get_rcvd_msg_cnt_per_slot besides the more general get_rcvd_msg_cnt, as the counts should be per memory slot. 3) Add a flush_send_sync function, which checks only on sender side that messages are not just posted, but also polled for. But I think this functionality is probably going away again.
…s without (b), finalization crashes. But in the near future, both of these will be removed from the sync for efficiency reasons.
… as this leads to additional data being allreduced in each sync. When the user issues runtime.abort(), the allreduce call is still made to check if everyone has called the abort.
…uce in sync. This is tricky though -- it means all parties synchronously call resize themselves, otherwise a deadlock might occur?
… all messages queued to be sent (via ibv_post_send) are sent out (via ibv_poll_cq). This is a requirement from the HiCR Channels library
Comment the post-install scripts as they fail running stuff for this branch.
…n call with expected sent and expected received messages as parameters. The tagged synchronization call without expected sent and expected received messages is not implemented yet. More testing needed on tagged sync.
…rk and is used by HiCR's fence(tag,key,sent_msgs,recvd_msgs) call. The tagged sync, which relies on syncPerSlot, is currently not finalized. This version only waits on the locally outstanding sends/receives for the slot, which does not mean any synchronization with other peers.
…pment. Now set to 7 / 7 for infinite polling, if needed.
…ky for HiCR, which then needs to do sync explicitly before checking these counters.
@anyzelman
Copy link
Member

anyzelman commented Mar 18, 2025

Refactoring and code review is done at this stage. TODOs:

  • run functests with and without zero-cost
  • look at zero post-install checks
  • determine if need to expand functests (probably yes)

After above boxes are ticked will do a final code review.

@anyzelman anyzelman self-requested a review March 18, 2025 10:36
KADichev and others added 9 commits March 19, 2025 14:38
…unctests option, bring this back into the boostrap get_and_build.sh script for CI. Also, during merge it seems add_gtest_mpi has been used, but this command does not exist anymore - use add_gtest
…cannot server as a common basis for zero and ibverbs. Therefore, we need separate zero.t.cpp tests for zero engine now.
@KADichev
Copy link
Collaborator Author

The latest refactoring has introduced issues, e.g. #58 and #59

KADichev and others added 14 commits March 31, 2025 20:10
…ore.cpp are compiled as non-mangled C symbols
…e quietly ignored, now they are being passed on to the MPI/zero.cpp and used
… This takes away some time during execution. Remove
…ot valid). Also, explicitly allow the scenario passing invalid tag + 0 expected received + 0 expected sent to be a non-blocking progress call. Also, some slightly improved logging at places
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zero engine fails post-install tests

3 participants