Skip to content

prov/efa: add signal-triggered RDM endpoint state dumper#12256

Open
shijin-aws wants to merge 1 commit into
ofiwg:mainfrom
shijin-aws:state_dumper
Open

prov/efa: add signal-triggered RDM endpoint state dumper#12256
shijin-aws wants to merge 1 commit into
ofiwg:mainfrom
shijin-aws:state_dumper

Conversation

@shijin-aws
Copy link
Copy Markdown
Contributor

Add a configurable state dump facility for debugging EFA RDM hangs. When FI_EFA_STATE_DUMP_SIGNAL is set to a signal number (e.g. 12 for SIGUSR2), the provider installs a handler that dumps EP counters and per-peer state (outstanding TXE/RXE, reorder buffer, RNR backoff, overflow) to stderr on the next CQ progress call.

In debug builds, queued packet headers are additionally printed via efa_rdm_pke_print().

Add a configurable state dump facility for debugging EFA RDM hangs.
When FI_EFA_STATE_DUMP_SIGNAL is set to a signal number (e.g. 12 for
SIGUSR2), the provider installs a handler that dumps EP counters and
per-peer state (outstanding TXE/RXE, reorder buffer, RNR backoff,
overflow) to stderr on the next CQ progress call.

In debug builds, queued packet headers are additionally printed via
efa_rdm_pke_print().

Signed-off-by: Shi Jin <sjina@amazon.com>
@shijin-aws shijin-aws requested a review from a team May 19, 2026 00:10
@darrylabbate
Copy link
Copy Markdown
Member

Currently this accepts a single signal code. Could we modify s.t. FI_EFA_STATE_DUMP_SIGNAL accepts a comma-separated list of codes?

@shijin-aws
Copy link
Copy Markdown
Contributor Author

Currently this accepts a single signal code. Could we modify s.t. FI_EFA_STATE_DUMP_SIGNAL accepts a comma-separated list of codes?

Currently it is triggered via kill -<signal_code>, would u mind sharing how multiple signal code will help here?

@alekswn
Copy link
Copy Markdown
Contributor

alekswn commented May 22, 2026

aws:bot:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants