prov/efa: detect and cancel in-flight ops on MR close via generation counter#12287
prov/efa: detect and cancel in-flight ops on MR close via generation counter#12287talavr-amazon wants to merge 11 commits into
Conversation
Introduce an ofi_bufpool that will back struct efa_mr allocations. The pool is created in efa_domain_open and destroyed in efa_domain_close. This commit only wires up the pool. No callers allocate from it yet, so behavior is unchanged. A subsequent patch will switch efa_mr_regattr and efa_mr_close to use the pool. The pool backs struct efa_mr so that stale desc pointers captured by in-flight ops remain dereferenceable for the lifetime of the domain; this is groundwork for a generation-based invalidation check that will be added on top. Signed-off-by: Tal Avraham <talavr@amazon.com>
Switch efa_mr_regattr and efa_mr_close to source struct efa_mr instances from efa_domain->mr_pool instead of calloc/free. Slot allocations and frees are serialized by util_domain.lock, which is already used elsewhere for MR-related state. Signed-off-by: Tal Avraham <talavr@amazon.com>
Add two uint32_t fields to struct efa_mr:
- gen: monotonic counter bumped on every close. Preserved across
bufpool slot reuse via efa_mr_reset, which leaves gen untouched.
- lkey: cached copy of ibv_mr->lkey, captured at registration.
Lets the data path read efa_mr->lkey directly without
dereferencing ibv_mr, which can race to NULL with a concurrent
close from another thread. The cached value is not cleared on
close; the gen check is what guards against using a stale lkey.
A subsequent patch will capture both values on an ope at dispatch
and compare them on the data path so in-flight ops targeting a
freed MR can be canceled rather than follow a stale pointer into a
reused slot.
This commit only wires up the fields. Nothing reads them yet, so
behavior is unchanged apart from a new unit test that verifies the
gen advances across a close/reopen cycle on a recycled slot.
Signed-off-by: Tal Avraham <talavr@amazon.com>
Extend efa_mr_pool_create/destroy to also create an mr_rdm_pool backing struct efa_rdm_mr instances. The pool is only created on RDM domains (EFA_INFO_TYPE_IS_RDM(info)); efa-direct and dgram domains skip it. This commit only wires up the pool. No callers allocate from it yet, so behavior is unchanged. A subsequent patch will switch the RDM MR alloc/free sites to use it. Same invariant as mr_pool: struct efa_rdm_mr slot addresses are preserved across the domain lifetime so the generation counter embedded in efa_mr can reliably detect slot reuse. Signed-off-by: Tal Avraham <talavr@amazon.com>
Switch efa_rdm_mr_regattr, efa_rdm_mr_cache_regv (non-cache branch), and efa_rdm_mr_close to source struct efa_rdm_mr instances from efa_domain->mr_rdm_pool instead of calloc/free. Pool alloc/free is serialized by util_domain.lock, matching the pattern used elsewhere for MR state. Introduce efa_rdm_mr_reset to initialize every field of a newly allocated struct efa_rdm_mr except for the embedded efa_mr.gen, which must be preserved across bufpool slot reuse. Signed-off-by: Tal Avraham <talavr@amazon.com>
Bump efa_mr.gen at the points where an MR's identity becomes invalid,
so in-flight ops that captured a stale desc can be detected and
dropped on the data path:
- efa_mr_dereg_impl: the single point that performs ibv_dereg_mr.
Covers every destruction path (efa_mr_close, efa_rdm_mr_close,
efa_rdm_mr_cache_entry_dereg) and reg_impl rollback paths. The
rollback bumps are redundant but harmless. The bump runs after
dereg completes so a gen tick reflects "destruction is done"
rather than "destruction is starting." The data path uses a
cached efa_mr->lkey rather than dereferencing ibv_mr, so
bumping after dereg is safe even with concurrent readers.
- efa_rdm_mr_cache_close: catches the case where the cache retains
the entry on LRU after fi_close. From the user's perspective the
MR is invalid even though no dereg fires.
Tests: rename test_efa_mr_gen_bumps_on_close to
test_efa_direct_mr_gen_bumps_on_close (efa-direct path), add
test_efa_rdm_mr_gen_bumps_on_close (RDM path, uses two buffers to
avoid cache hits).
Signed-off-by: Tal Avraham <talavr@amazon.com>
Add desc_gen[] and desc_lkey[] arrays to struct efa_direct_ope. When direct_ope->desc[] is populated in efa_direct_ope_alloc, snapshot each desc's efa_mr->gen and efa_mr->lkey into these arrays. desc[] holds struct efa_mr * (cast to void *), one per iov entry, since each iov can reference a different MR with a different lkey. A subsequent patch will compare the captured gen against the live efa_mr->gen on the data path and use the captured lkey to construct WRs, so the data path never dereferences ibv_mr in the window between the gen check and lkey use. This commit only wires up the capture. Nothing reads the captured state yet, so behavior is unchanged. Signed-off-by: Tal Avraham <talavr@amazon.com>
Add desc_gen[] and desc_lkey[] arrays to struct efa_rdm_ope. Introduce efa_rdm_ope_capture_desc() to snapshot each desc's efa_mr->gen and efa_mr->lkey, and call it from every site that populates desc[]: - efa_rdm_txe_construct (send/write/read/atomic dispatch) - efa_rdm_ope_try_fill_desc (late-bound provider-registered MRs) - the inject-RMA path that rebinds desc[0] - efa_rdm_srx_update_rxe_with_peer_rxe (fi_recv via SRX) A subsequent patch will compare the captured gen against the live efa_mr->gen on the data path and use the captured lkey to construct WRs, so the data path never dereferences ibv_mr in the window between the gen check and lkey use. This commit only wires up the capture. Nothing reads the captured state yet, so behavior is unchanged. Signed-off-by: Tal Avraham <talavr@amazon.com>
Replace efa_mr->ibv_mr->lkey with efa_mr->lkey at all four sites where the efa-direct path builds WR scatter-gather lists. The cached lkey was set at registration and is never cleared, so a concurrent fi_close that NULLs ibv_mr can no longer cause a NULL deref. If a close races with WR submission, the WR goes out with a stale lkey and the NIC returns a work-completion error, which the existing error handler surfaces to the application. No gen check is needed here because efa-direct posts WRs synchronously (no queue between dispatch and post); the NIC catches stale lkeys within microseconds. Signed-off-by: Tal Avraham <talavr@amazon.com>
Eliminate all ibv_mr->lkey dereferences in the RDM data path by switching to the cached efa_mr->lkey field. This prevents NULL pointer dereferences when a concurrent fi_close clears ibv_mr. Add efa_rdm_mr_gen_check_ope() which compares the live efa_mr->gen against the snapshot captured at dispatch. Call it at every entry point where a queued or in-flight operation is posted to the NIC: - efa_rdm_ope_post_send (all ope posts, fresh or reposted) - efa_rdm_ope_repost_ope_queued_before_handshake - efa_rdm_ope_post_read - efa_rdm_ep_post_queued_pkts (RNR-retransmit write path) On gen mismatch the operation is failed with -FI_ECANCELED, surfacing a clean CQ error to the application instead of a crash. For provider-internal MRs (pkt_entry->mr, payload_mr, pke_vec->mr) the switch to ->lkey is safe because these MRs are endpoint-owned and never subject to user-driven close. Signed-off-by: Tal Avraham <talavr@amazon.com>
Test that efa_rdm_mr_gen_check_ope correctly detects a closed MR. Simulates the queued-before-handshake scenario: dispatch captures desc state, app closes the MR while the ope is queued, and the gen check returns -FI_ECANCELED on the repost path. Signed-off-by: Tal Avraham <talavr@amazon.com>
| int ret; | ||
|
|
||
| /* Only efa-direct and dgram domains allocate struct efa_mr. */ | ||
| if (EFA_INFO_TYPE_IS_RDM(info)) |
There was a problem hiding this comment.
I think we want this change for efa-direct (buffpool yes, gencount no) and efa-proto. efa-direct uses struct efa_mr and efa-proto uses struct efa_rdm_mr. I do not understand how this is working
| (void *)mr_attr->mr_iov->iov_base, | ||
| mr_attr->mr_iov->iov_len, access); | ||
| } | ||
There was a problem hiding this comment.
nit: spacing in new commit (multiple places in this file)
| enum fi_hmem_iface iface; | ||
| }; | ||
|
|
||
| /** |
There was a problem hiding this comment.
nit: We add doyxegen comments to the impl, not the declaration. I guess we are semi-inconsistent... but up to you
| return -FI_ENOMEM; | ||
| } | ||
| memset(efa_mr, 0, sizeof(*efa_mr)); | ||
| efa_mr_reset(efa_mr); |
There was a problem hiding this comment.
Its silly to add something and immediately change it in the next patch... Should just drag this helper up one patch earlier.
| @@ -19,7 +19,7 @@ | |||
| struct dlist_entry g_efa_domain_list; | |||
There was a problem hiding this comment.
I recommend squashing commit 1 with this commit.
| * counter detects slot reuse. | ||
| */ | ||
| struct ofi_bufpool *mr_pool; | ||
| union { |
There was a problem hiding this comment.
NIT: I think this is kind of silly when both mr_pool and mr_rdm_pool are the same type... recommend removing union (and comment in efa_mr_pool_destroy() related to union).
| @@ -122,7 +122,7 @@ static inline ssize_t efa_post_recv(struct efa_base_ep *base_ep, const struct fi | |||
| goto out_err; | |||
| } | |||
| efa_mr = (struct efa_mr *)msg->desc[i]; | |||
There was a problem hiding this comment.
missing zero byte bounce buffer... this is a strict optimization commit to avoid a pointer chase, update the commit message to be correct).
| * that captured a stale desc can detect the invalidation | ||
| * before dereferencing ibv_mr. | ||
| */ | ||
| uint32_t gen; |
There was a problem hiding this comment.
We do not want gen on efa-direct's MR b/c we never use them for efa-direct, it is efa-proto only. Lkey can still be here to avoid a pointer chase b/c this structure is 64 bytes with lkey.
| * avoid dereferencing ibv_mr. | ||
| */ | ||
| uint32_t desc_gen[EFA_RDM_IOV_LIMIT]; | ||
| uint32_t desc_lkey[EFA_RDM_IOV_LIMIT]; |
| remote_buf = rma_context_pkt->remote_buf; | ||
| remote_key = rma_context_pkt->remote_key; | ||
|
|
||
| assert(((struct efa_mr *)desc)->ibv_mr); |
There was a problem hiding this comment.
When we remove asserts... we should have comments as to why in the commit
| @@ -753,7 +753,10 @@ ssize_t efa_rdm_ep_post_queued_pkts(struct efa_rdm_ep *ep, | |||
| switch (efa_rdm_pkt_type_of(pkt_entry)) { | |||
There was a problem hiding this comment.
You should split this commit to do:
- Eliminate all ibv_mr->lkey dereferences in the RDM data path by
switching to the cached efa_mr->lkey field.
a. Can pack the assert removal in here if you explain it in the commit message - if (!efa_rdm_mr_gen_check_ope(pkt_entry->ope))
a-szegel
left a comment
There was a problem hiding this comment.
ooops, accidentally clicked approve instead of request changes :(
When an application closes a memory region while operations referencing it are still queued (e.g., waiting for a peer handshake or RNR retransmit), the provider could crash by dereferencing a NULL ibv_mr pointer. This series fixes the problem by allocating MR structs (efa_mr and efa_rdm_mr) from per-domain bufpools so that slot addresses remain stable across close/reopen cycles, then adding a generation counter and a cached lkey to each MR. The counter is bumped on every close, and the cached lkey (set at registration, never cleared) allows the data path to read the lkey without dereferencing ibv_mr, which can be NULL after a concurrent close from another thread. At dispatch time, each operation entry snapshots the current gen and lkey of its MR descriptors. On the data path, before posting a WR to the NIC, the provider compares the live gen against the snapshot — on mismatch the operation is failed with -FI_ECANCELED and a clean CQ error is delivered to the application. All ibv_mr->lkey dereferences are replaced with reads of the cached field, eliminating NULL-deref crashes even if a close races with WR submission on a different thread.