The zero engine used in HiCR is applied on top of latest master #52

KADichev · 2024-12-19T20:38:24Z

No description provided.

anyzelman

Incomplete review my side -- couple of main things:

this MR adds primitives to the core API that should either be encapsulated in a standard lpf_sync OR moved to an LPF extension header (and, as a consequence, some other engines need not be modified);
the reframe & CI additions seem quite site-specific and therefore better hidden from upstream; and
I'm not sure if the new collective should be part of this MR?

Also, at times I'm not sure if I'm looking at an up-to-date MR (e.g. due to lpf_abort here being defined in core.h where I thought this was moved to an extension already?)

src/hybrid/state.hpp

anyzelman · 2025-02-03T22:00:49Z

argh... most of my review details seems lost -.-

…quests in a queue pair), not just relying on device information about max_qp_wr, but actually trying to create QPs via ibv_create_qp with different max_send_wr until we find the largest still working number (via binary search). This becomes the updated m_maxSrs. Independently, the 100K element test manyPuts needs to be downgraded to 5K for our cluster, as our count is just over 10K, but actually 10K does not work as well (not sure why?)

…so that these could be assigned to different LPF functions (e.g., trigger send early by moving ibv_post_send calls into IBVerbs::put

…(hopefully) through integrating BSC changes to enable both local and remote completion queues, which is key if we want to read the number of messages received or posted.

…eived events.

… us to notice new reads/writes too late.

…slot. This is currently done via imm_data field which carries the memory slot ID of the destination at the sender before it is RDMA written. After a poll finds that a message has been received, the imm_data entry is being read and used as a key for a hash table, where the value is the number of receives (being incremented at each receive at the right key). The lookup at the receiver is then just a lookup of this hash table. There is currently a problem in lines around 840 of mesgqueue.cpp, where the destination ID is being reset to zero. This needs to be solved. Trying to resolve conflicts between old addition of get received message count and new abort functionality for tests. For now, removing the get received functionality, because I am not really convinced we need it.

…dified slot ID if edge buffer is used. The original slot ID is then only used as a key for hashtable with key = slot ID and value = number of received messages

…ut directly calls IBVerbs put, and LPF sync only waits on the local completion of IBVerbs put (via polling that the message has been sent -- but no confirmation exists the message has been received). I still keep one barrier in the IBVerbs::sync for synchronicity, but this barrier should be removed in the future.

…within LPF as I need it. 2) Add get_rcvd_msg_cnt_per_slot besides the more general get_rcvd_msg_cnt, as the counts should be per memory slot. 3) Add a flush_send_sync function, which checks only on sender side that messages are not just posted, but also polled for. But I think this functionality is probably going away again.

…s without (b), finalization crashes. But in the near future, both of these will be removed from the sync for efficiency reasons.

… as this leads to additional data being allreduced in each sync. When the user issues runtime.abort(), the allreduce call is still made to check if everyone has called the abort.

…uce in sync. This is tricky though -- it means all parties synchronously call resize themselves, otherwise a deadlock might occur?

… all messages queued to be sent (via ibv_post_send) are sent out (via ibv_poll_cq). This is a requirement from the HiCR Channels library

Comment the post-install scripts as they fail running stuff for this branch.

…n call with expected sent and expected received messages as parameters. The tagged synchronization call without expected sent and expected received messages is not implemented yet. More testing needed on tagged sync.

…rk and is used by HiCR's fence(tag,key,sent_msgs,recvd_msgs) call. The tagged sync, which relies on syncPerSlot, is currently not finalized. This version only waits on the locally outstanding sends/receives for the slot, which does not mean any synchronization with other peers.

…pment. Now set to 7 / 7 for infinite polling, if needed.

…ky for HiCR, which then needs to do sync explicitly before checking these counters.

…n HiCR

anyzelman · 2025-03-18T10:36:35Z

Refactoring and code review is done at this stage. TODOs:

run functests with and without zero-cost
look at zero post-install checks
determine if need to expand functests (probably yes)

After above boxes are ticked will do a final code review.

…he pipeline

…unctests option, bring this back into the boostrap get_and_build.sh script for CI. Also, during merge it seems add_gtest_mpi has been used, but this command does not exist anymore - use add_gtest

…cannot server as a common basis for zero and ibverbs. Therefore, we need separate zero.t.cpp tests for zero engine now.

KADichev · 2025-03-28T16:21:55Z

The latest refactoring has introduced issues, e.g. #58 and #59

This reverts commit 16a9fdf.

…ore.cpp are compiled as non-mangled C symbols

… eventually be removed

…e quietly ignored, now they are being passed on to the MPI/zero.cpp and used

… This takes away some time during execution. Remove

…ot valid). Also, explicitly allow the scenario passing invalid tag + 0 expected received + 0 expected sent to be a non-blocking progress call. Also, some slightly improved logging at places

KADichev requested a review from anyzelman December 20, 2024 07:00

KADichev linked an issue Jan 17, 2025 that may be closed by this pull request

zero engine fails post-install tests #53

Closed

anyzelman requested changes Feb 3, 2025

View reviewed changes

src/hybrid/state.hpp Outdated Show resolved Hide resolved

This was referenced Feb 12, 2025

Lpf collectives enhancement #54

Open

Lpf mutex extension #55

Open

KADichev and others added 24 commits February 12, 2025 18:18

Separate the ibv_post_send and ibv_poll_cq into different functions, …

c5965c4

…so that these could be assigned to different LPF functions (e.g., trigger send early by moving ibv_post_send calls into IBVerbs::put

Extended LPF to expose lpf_get_rcvd_msg_count function. Also halfway …

97de831

…(hopefully) through integrating BSC changes to enable both local and remote completion queues, which is key if we want to read the number of messages received or posted.

ibv_post_recv in new version fails at reconnectQPs

f2f6800

This version completes with HiCR, but still does not register ANY rec…

8899406

…eived events.

Very importantly, remove sleeps in the progress engine, as this leads…

b74af3d

… us to notice new reads/writes too late.

Change IBVerbs::put to accept an original slot ID and the possibly mo…

19fb996

…dified slot ID if edge buffer is used. The original slot ID is then only used as a key for hashtable with key = slot ID and value = number of received messages

Clean up a bit

12e09e4

Minor cleanup

1da6961

For now, bring back the allreduce for a) resize b) abort into sync, a…

b03a5a5

…s without (b), finalization crashes. But in the near future, both of these will be removed from the sync for efficiency reasons.

This commit removes the check on abort from the sync call altogether,…

07136eb

… as this leads to additional data being allreduced in each sync. When the user issues runtime.abort(), the allreduce call is still made to check if everyone has called the abort.

This commit removes the exchange of resize memreg/messages via allred…

ec36eb7

…uce in sync. This is tricky though -- it means all parties synchronously call resize themselves, otherwise a deadlock might occur?

Add the lpf_flush function to LPF, which makes sure for IB verbs that…

55cc751

… all messages queued to be sent (via ibv_post_send) are sent out (via ibv_poll_cq). This is a requirement from the HiCR Channels library

Update CMakeLists.txt

8f624a6

Comment the post-install scripts as they fail running stuff for this branch.

Remove debug msg

96fb88b

Start work on compare and swap

1d5d3ae

The attributes retry_cnt and rnr_retry were set to 6 and 0 for develo…

a111949

…pment. Now set to 7 / 7 for infinite polling, if needed.

Make lookup of message counters pure lookup, no polling. This is tric…

a52233c

…ky for HiCR, which then needs to do sync explicitly before checking these counters.

Some very early documentation of the extensions in lpf/core.h, used i…

731cde7

…n HiCR

anyzelman added 4 commits March 18, 2025 10:41

Code review imp\.core.c

70478d2

Code review zero.cpp, pass I

a46e3ab

Code review: dead code removal

411b2cc

Code review zero.cpp pass II

30fb284

anyzelman self-requested a review March 18, 2025 10:36

KADichev and others added 9 commits March 19, 2025 14:38

Bring in gitlab ci and reframe config. Probably needs fixing to run t…

bd2a097

…he pipeline

Fix token

657a559

Try to match with the existing gitlab runners

bb00b73

Revert x86 tag, go back to slurm tag

0151020

Yet another tag. Tired of this crap

bfcf7f3

bootstrap.sh script requires non-interactive agreement string after f…

57eff9a

…unctests option, bring this back into the boostrap get_and_build.sh script for CI. Also, during merge it seems add_gtest_mpi has been used, but this command does not exist anymore - use add_gtest

Zero engine API is different now from the IBVerbs API. ibverbs.t.cpp …

e3f3e8b

…cannot server as a common basis for zero and ibverbs. Therefore, we need separate zero.t.cpp tests for zero engine now.

Fix lost CMake change for zero-engine unit tests

39d01d5

Try to fix CI by increasing DISCOVERY_TIMEOUT

9d95a9b

KADichev and others added 14 commits March 31, 2025 20:10

Fix for #58 and #59

a296bfb

Merge ../LPF-gitlab2 into zero_engine_MR

d3139cf

Include correct zero.h header

16a9fdf

Revert "Include correct zero.h header"

f0a256a

This reverts commit 16a9fdf.

Include zero.h LPF core API extension, so that the functions in MPI/c…

3cda2e3

…ore.cpp are compiled as non-mangled C symbols

Add log message for the dynamic tag reallocation, as it probably will…

2125535

… eventually be removed

Changes towards tag-based implementation. Before, some attributes wer…

8451e44

…e quietly ignored, now they are being passed on to the MPI/zero.cpp and used

Develop-stage barrier removed

f5d2621

More sensible debug output

40dacc3

The vector of opcodes in doLocalProgress is populated but never used.…

fc00f6b

… This takes away some time during execution. Remove

A bug fix in countingSyncPerSlot (don't ask for tagActive if tag is n…

6f81f6e

…ot valid). Also, explicitly allow the scenario passing invalid tag + 0 expected received + 0 expected sent to be a non-blocking progress call. Also, some slightly improved logging at places

Partial rollback 6f81f6e

96dc752

Remove trailing spaces

d707143

Closes issue #63

1985894

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The zero engine used in HiCR is applied on top of latest master #52

The zero engine used in HiCR is applied on top of latest master #52

Uh oh!

KADichev commented Dec 19, 2024

Uh oh!

anyzelman left a comment

Uh oh!

Uh oh!

anyzelman commented Feb 3, 2025

Uh oh!

anyzelman commented Mar 18, 2025 •

edited

Loading

Uh oh!

KADichev commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

The zero engine used in HiCR is applied on top of latest master #52

Are you sure you want to change the base?

The zero engine used in HiCR is applied on top of latest master #52

Uh oh!

Conversation

KADichev commented Dec 19, 2024

Uh oh!

anyzelman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anyzelman commented Feb 3, 2025

Uh oh!

anyzelman commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KADichev commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anyzelman commented Mar 18, 2025 •

edited

Loading