Closed
Description
Describe the bug
An assertion failure occurs when an endpoint closes and tcp
and cuda_copy
(UCX_TLS=tcp,cuda_copy
) are active and the remote endpoint terminates, the issue does not occur with UCX_TLS=rc,cuda_copy
. In the simplest reproducer, a ucx_perftest
listener is launched and then the client process is launched for 2-3 seconds, after which the user presses CTRL+C
to terminate the client. The reverse (CTRL+C
on the listener) does not seem to cause the same behavior.
This issue occurs on Dask/UCX-Py as well, but it isn't clear yet whether it happens when the endpoint is terminating or just during some transfer.
Complete output
$ UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000
[1666902559.828992] [dgx13:34156:0] perftest.c:921 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 127.0.0.1:58042
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda-managed |
| Recv memory: cuda-managed |
| Message size: 1000000000 |
+----------------------------------------------------------------------------------------------------------+
[dgx13:34156:0:34156] ucp_ep.c:3410 Assertion `req->send.ep == ucp_ep' failed
==== backtrace (tid: 34156) ====
0 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_handle_error+0x2d4) [0x7f12e32f6e14]
1 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_fatal_error_message+0xb8) [0x7f12e32f3ce8]
2 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7f12e32f3dd1]
3 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_ep_req_purge+0xb2a) [0x7f12e37c545a]
4 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_ep_req_purge+0x307) [0x7f12e37c4c37]
5 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_rndv_send_handle_status_from_pending+0x6b) [0x7f12e383288b]
6 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(+0xaaeb7) [0x7f12e3829eb7]
7 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(+0xb03f0) [0x7f12e382f3f0]
8 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_rndv_receive+0x4a5) [0x7f12e3830e85]
9 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_tag_rndv_process_rts+0x35a) [0x7f12e3859f1a]
10 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x24e4a) [0x7f12e3557e4a]
11 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x25b74) [0x7f12e3558b74]
12 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(+0x296c0) [0x7f12e355c6c0]
13 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7f12e3302cf1]
14 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libuct.so.0(uct_tcp_iface_progress+0x90) [0x7f12e355c7b0]
15 /datasets/pentschev/miniconda3/envs/rn-221026/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7f12e37dc56a]
16 ucx_perftest(+0x8ac1d) [0x564582e22c1d]
17 ucx_perftest(+0x79332) [0x564582e11332]
18 ucx_perftest(+0xd120) [0x564582da5120]
19 ucx_perftest(+0x6edd) [0x564582d9eedd]
20 ucx_perftest(+0x6ffb) [0x564582d9effb]
21 ucx_perftest(+0x4448) [0x564582d9c448]
22 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f12e2cd6c87]
23 ucx_perftest(+0x44fa) [0x564582d9c4fa]
=================================
Aborted (core dumped)
Steps to Reproduce
- Command line server:
UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000
- Command line client:
UCX_TLS=tcp,cuda_copy CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda-managed -n 1000 -s 1000000000 localhost
- UCX current master @ ecb0db9
Setup and versions
- DGX-1 with 8 x NVIDIA V100
- Linux dgx13 4.15.0-189-generic # 200-Ubuntu SMP Wed Jun 22 19:53:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- MOFED 5.5-1.0.3.2
- NVIDIA driver: 510.73.08
- CUDA 11.5
nv_peer_mem
module loaded