Skip to content

Commit e6b751e

Browse files
authored
Merge pull request #8966 from jsquyres/pr/fix-tcp-btl-race-condition
v4.1.x: btl tcp: Add workaround for "dropped connection" issue
2 parents 2b3f043 + e1612b0 commit e6b751e

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

opal/mca/btl/tcp/btl_tcp_component.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1291,6 +1291,24 @@ mca_btl_base_module_t** mca_btl_tcp_component_init(int *num_btl_modules,
12911291
}
12921292
}
12931293

1294+
/* Avoid a race in wire-up when using threads (progess or user)
1295+
and multiple BTL modules. The details of the race are in
1296+
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032,
1297+
but the summary is that the lookup code in
1298+
component_recv_handler() below assumes that add_procs() is
1299+
atomic across all active TCP BTL modules, but in multi-threaded
1300+
code, that isn't guaranteed, because the locking is inside
1301+
add_procs(), and add_procs() is called once per module. This
1302+
isn't a proper fix, but will solve the "dropped connection"
1303+
problem until we can come up with a more complete fix to how we
1304+
initialize procs, endpoints, and modules in the TCP BTL. */
1305+
if (mca_btl_tcp_component.tcp_num_btls > 1 &&
1306+
(enable_mpi_threads || 0 < mca_btl_tcp_progress_thread_trigger)) {
1307+
for( i = 0; i < mca_btl_tcp_component.tcp_num_btls; i++) {
1308+
mca_btl_tcp_component.tcp_btls[i]->super.btl_flags |= MCA_BTL_FLAGS_SINGLE_ADD_PROCS;
1309+
}
1310+
}
1311+
12941312
#if OPAL_CUDA_SUPPORT
12951313
mca_common_cuda_stage_one_init();
12961314
#endif /* OPAL_CUDA_SUPPORT */

0 commit comments

Comments
 (0)