Skip to content

prov/efa: Fix efa_rx_pkts_held counter drift#12247

Open
gjamesli2126 wants to merge 3 commits into
ofiwg:mainfrom
gjamesli2126:fix-efa_rdm_ep-efa_rx_pkts_held
Open

prov/efa: Fix efa_rx_pkts_held counter drift#12247
gjamesli2126 wants to merge 3 commits into
ofiwg:mainfrom
gjamesli2126:fix-efa_rdm_ep-efa_rx_pkts_held

Conversation

@gjamesli2126
Copy link
Copy Markdown
Contributor

@gjamesli2126 gjamesli2126 commented May 15, 2026

Original Issue:

efa_rdm_ep->efa_rx_pkts_held drifted with FI_EFA_RX_COPY_UNEXP=0 / FI_EFA_RX_COPY_OOO=0.

Fixed with: Track efa_rx_pkts_held with a per-pkt flag

  • Funnel held++/-- through EFA_RDM_PKE_HELD_BY_PROGRESS flag.
  • Add missing held++ in pke_get_unexp() (rx_copy_unexp=0).
  • Add missing held++ in pke_get_ooo_pke() (rx_copy_ooo=0).

Fixed with: Release held rx pkt on local-read teardown

Local-read txe holds pkt only via txe->local_read_pkt_entry; if torn down before read completes, pkt was orphaned.

  • txe_release(): release pkt if non-NULL.
  • txe_construct(): NULL-init.
  • pke_handle_rma_read_completion() success path: NULL the field to avoid double-free.

Validation

Debug build on p5en with FI_PROVIDER=efa FI_EFA_RX_COPY_UNEXP=0 FI_EFA_RX_COPY_OOO=0.
Oracle: assert(held + posted + to_post == rx_pool_size) at efa_rdm_ep_utils.c:963.

Test Path Pre-fix Post-fix
fi_unexpected_msg unexpected-recv abort@963 clean
fi_rdm_tagged_bw OOO + long-msg abort@963 clean
fi_rdm_tagged_bw -D cuda CUDA local-read abort@963 clean

Fix double-free in local-read unit test

txe_release now frees the held pkt on error path, so the test must not release it again.

  • Happy path: NULL txe->local_read_pkt_entry after manual release; teardown's txe_release would otherwise double-free.
  • Unhappy path: drop the manual release entirely (txe_release already handles it).
  • Both paths: verify efa_rx_pkts_held / efa_rx_pkts_to_post deltas + held flag.

Out of scope/TODO: queued blocking copy mid-batch leak

flush_queued_blocking_copy_to_hmem() early-return leaks pkts of other still-alive rxes.
Naive release double-frees with rxe error path.

Copy link
Copy Markdown
Contributor

@sunkuamzn sunkuamzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check CI failures

Comment thread prov/efa/src/rdm/efa_rdm_ope.c Outdated
@sunkuamzn sunkuamzn requested a review from a team May 15, 2026 19:12
@gjamesli2126 gjamesli2126 force-pushed the fix-efa_rdm_ep-efa_rx_pkts_held branch 2 times, most recently from dda8ff8 to 1f201cd Compare May 15, 2026 21:40
@gjamesli2126 gjamesli2126 requested a review from sunkuamzn May 16, 2026 02:31
Comment thread prov/efa/src/rdm/efa_rdm_ope.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_ope.c Outdated
Comment thread prov/efa/test/efa_unit_test_ope.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_pke_nonreq.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_ope.c Outdated
@gjamesli2126 gjamesli2126 force-pushed the fix-efa_rdm_ep-efa_rx_pkts_held branch from 1f201cd to 731dd31 Compare May 21, 2026 00:34
Comment thread prov/efa/src/rdm/efa_rdm_pke_nonreq.c Outdated
@gjamesli2126 gjamesli2126 force-pushed the fix-efa_rdm_ep-efa_rx_pkts_held branch from 731dd31 to 548d5fe Compare May 21, 2026 16:48
shijin-aws
shijin-aws previously approved these changes May 21, 2026
@gjamesli2126 gjamesli2126 requested a review from sunkuamzn May 21, 2026 17:27
@sunkuamzn
Copy link
Copy Markdown
Contributor

bot:aws:retest

2 similar comments
@gjamesli2126
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@gjamesli2126
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

Comment thread prov/efa/src/rdm/efa_rdm_pke_nonreq.c Outdated
shijin-aws
shijin-aws previously approved these changes May 22, 2026
Comment thread prov/efa/src/rdm/efa_rdm_ope.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_pke.c Outdated
Replace open-coded efa_rx_pkts_held++/-- with EFA_RDM_PKE_HELD_BY_PROGRESS:
- mark_held() sets the flag and does counter++.
- In release_rx(), when it sees the flag, does counter--.
- Every hold pairs with exactly one release through a single funnel.

Fixes two missing counter++ cases the old scheme missed:
- efa_rdm_pke_get_unexp() with rx_copy_unexp=0: rx-pool pkt is parked in the unexp queue (no clone to the unexp pool).
- efa_rdm_pke_get_ooo_pke() with rx_copy_ooo=0: rx-pool pkt is parked in the recv window (no clone to the OOO pool).
The rx-pool pkt is now occupying a waiting-room slot, so the counter must reflect that. The old code didn't.

Signed-off-by: James Li <jameslli@amazon.com>
In efa_rdm_pke_handle_rma_read_completion's local-read branch, NULL
txe->local_read_pkt_entry after efa_rdm_pke_handle_data_copied()
takes ownership. Also NULL-init the field in txe_construct.

Signed-off-by: James Li <jameslli@amazon.com>
The happy-path test simulated post-completion cleanup manually,
hiding the behavior of efa_rdm_pke_handle_rma_read_completion.
Drive efa_rdm_pke_handle_rma_completion instead, so a regression
in the handler would actually fail the test.

In the unhappy path, mirror the caller's release of the held rx pkt
to keep counter assertions balanced.

Signed-off-by: James Li <jameslli@amazon.com>
@gjamesli2126
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants