Skip to content

Assertion `req->send.ep == ucp_ep' failed when endpoint/process terminates #8669

Closed
@pentschev

Description

@pentschev

Describe the bug

An assertion failure occurs when an endpoint closes and tcp and cuda_copy (UCX_TLS=tcp,cuda_copy) are active and the remote endpoint terminates, the issue does not occur with UCX_TLS=rc,cuda_copy. In the simplest reproducer, a ucx_perftest listener is launched and then the client process is launched for 2-3 seconds, after which the user presses CTRL+C to terminate the client. The reverse (CTRL+C on the listener) does not seem to cause the same behavior.

This issue occurs on Dask/UCX-Py as well, but it isn't clear yet whether it happens when the endpoint is terminating or just during some transfer.

Complete output
$ UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000
[1666902559.828992] [dgx13:34156:0]        perftest.c:921  UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 127.0.0.1:58042
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         tag match bandwidth                                                                        |
| Data layout:  (automatic)                                                                                |
| Send memory:  cuda-managed                                                                               |
| Recv memory:  cuda-managed                                                                               |
| Message size: 1000000000                                                                                 |
+----------------------------------------------------------------------------------------------------------+
[dgx13:34156:0:34156]      ucp_ep.c:3410 Assertion `req->send.ep == ucp_ep' failed
==== backtrace (tid:  34156) ====
 0  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_handle_error+0x2d4) [0x7f12e32f6e14]
 1  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_fatal_error_message+0xb8) [0x7f12e32f3ce8]
 2  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7f12e32f3dd1]
 3  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_ep_req_purge+0xb2a) [0x7f12e37c545a]
 4  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_ep_req_purge+0x307) [0x7f12e37c4c37]
 5  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_rndv_send_handle_status_from_pending+0x6b) [0x7f12e383288b]
 6  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(+0xaaeb7) [0x7f12e3829eb7]
 7  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(+0xb03f0) [0x7f12e382f3f0]
 8  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_rndv_receive+0x4a5) [0x7f12e3830e85]
 9  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_tag_rndv_process_rts+0x35a) [0x7f12e3859f1a]
10  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x24e4a) [0x7f12e3557e4a]
11  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x25b74) [0x7f12e3558b74]
12  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x296c0) [0x7f12e355c6c0]
13  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7f12e3302cf1]
14  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(uct_tcp_iface_progress+0x90) [0x7f12e355c7b0]
15  /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7f12e37dc56a]
16  ucx_perftest(+0x8ac1d) [0x564582e22c1d]
17  ucx_perftest(+0x79332) [0x564582e11332]
18  ucx_perftest(+0xd120) [0x564582da5120]
19  ucx_perftest(+0x6edd) [0x564582d9eedd]
20  ucx_perftest(+0x6ffb) [0x564582d9effb]
21  ucx_perftest(+0x4448) [0x564582d9c448]
22  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f12e2cd6c87]
23  ucx_perftest(+0x44fa) [0x564582d9c4fa]
=================================
Aborted (core dumped)

Steps to Reproduce

  • Command line server: UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000
  • Command line client: UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000 localhost
  • UCX current master @ ecb0db9

Setup and versions

  • DGX-1 with 8 x NVIDIA V100
  • Linux dgx13 4.15.0-189-generic # 200-Ubuntu SMP Wed Jun 22 19:53:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • MOFED 5.5-1.0.3.2
  • NVIDIA driver: 510.73.08
  • CUDA 11.5
  • nv_peer_mem module loaded

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions