Hang in ROS2 Fast-RTPS during destruction [3094] #235

clalancette · 2018-06-22T19:53:31Z

I'm still debugging this, so I don't have all of the information. Nonetheless, this problem looks like it may be in Fast-RTPS, so opening this issue to get some visibility and maybe some guidance.

I'm currently debugging a failure in https://github.com/ros2/examples/blob/master/rclcpp/minimal_subscriber/not_composable.cpp ; the initial report is here: ros2/examples#209 . Running that code as-is causes the error message from that other issue.

While looking at the code with @wjwwood , however, we realized that this line is probably the culprit: https://github.com/ros2/examples/blob/master/rclcpp/minimal_subscriber/not_composable.cpp#L38 . That is, we force the node to be destroyed before the subscription (which will get destroyed when it goes out of scope). One easy solution is to just remove line 38. However, when I do that, the node hangs when I hit Ctrl-C. After doing some debugging in gdb, I see the following:

Thread 6 (Thread 0x7ffff14fd700 (LWP 31847)):
#0  0x00007ffff5ee49f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4409848 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, 
    mutex=0x7ffff4409880 <eprosima::fastrtps::rtps::AsyncWriterThread::condition_variable_mutex_>, 
    cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>, 
    mutex=0x7ffff4409880 <eprosima::fastrtps::rtps::AsyncWriterThread::condition_variable_mutex_>) at pthread_cond_wait.c:655
#3  0x00007ffff6bca620 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff3fbff48 in eprosima::fastrtps::rtps::AsyncWriterThread::run ()
    at /home/ubuntu/ros2_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/AsyncWriterThread.cpp:135
#5  0x00007ffff6bd0733 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5ede6db in start_thread (arg=0x7ffff14fd700) at pthread_create.c:463
#7  0x00007ffff662a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7fcd580 (LWP 31839)):
#0  0x00007ffff5ee4449 in futex_wait (private=<optimized out>, expected=12, 
    futex_word=0x7ffff4409844 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=12, 
    futex_word=0x7ffff4409844 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+36>) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>) at pthread_cond_destroy.c:54
#3  0x00007ffff654c041 in __run_exit_handlers (status=0, listp=0x7ffff68f4718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#4  0x00007ffff654c13a in __GI_exit (status=<optimized out>) at exit.c:139
#5  0x00007ffff652ab9e in __libc_start_main (main=0x5555555665d0 <main(int, char**)>, argc=1, argv=0x7fffffff6528, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff6518) at ../csu/libc-start.c:344
#6  0x00005555555661aa in _start ()

(I've elided the rest of the threads for brevity). It looks like what is happening is that AsyncWriterThread::run in Thread 6 is currently waiting to be woken up from the cv_ condition variable, implying it is holding the lock. Thread 1 looks to be trying to destroy the condition variable, but it is attempting to take the lock first, and this deadlocks. I'm still looking into this, but any thoughts or advice welcome. @richiware FYI.

The text was updated successfully, but these errors were encountered:

anfrox · 2018-08-02T11:59:20Z

I see a similar problem without ROS2 on Windows 10 x64. On Win7 the same executable (build from source on the particular platform) did not hand during shutdown. Even if subscribers are removed before publishers, removing (the one and only) participant, there are one or both AsyncWriterThread threads active (not suspended/waiting). It looks like that the abort messages send via UDP gets not received while AsyncWriterThread threads are active. Minutes later the abort messages get received, the 3 blocking receive threads gets done and the join method returns. I monitored the UDP communication via Wireshark and it lists the UDP messages (length=13) also minutes later when the join returns.

richiware · 2018-11-16T09:48:40Z

We cannot reproduce this behavior. New release 1.7.0 comes with changes in the network transport layers. Can you test this behavior still occurs? Thanks

alsora · 2019-01-24T13:59:41Z

Is this issue solved with ROS2 Crystal release?

clalancette · 2019-01-24T14:41:28Z

Sorry, I haven't had time to test this out again yet. I'll try to find some time in the next couple of days.

ssnover · 2019-06-12T04:18:36Z

I've just tried to recreate the issue as described here ros2/examples#209 under ROS 2 Dashing release and the issue persists.

MiguelCompany · 2019-06-12T05:09:18Z

@ssnover95 our develop branch includes PR #540, which fixes a data race when destroying the participant that may be the cause of the problems you have (both the hung and the crash). Could you check using the develop branch of Fast-RTPS?

MiguelCompany · 2019-06-26T06:59:17Z

@clalancette @ssnover95 In a couple of days we are going to release 1.8.1 (see #574), which includes #569 that fixes some hang cases when destroying a participant. Could you check if this issue is fixed with those changes?

LuisGP · 2019-09-09T14:20:01Z

@clalancette @ssnover95 Can we close this issue?

clalancette · 2019-11-25T14:42:02Z

I'm going to close this out, as I haven't seen this particular issue in a while. If it comes up again, I'll re-open. Thanks.

clalancette mentioned this issue Jun 22, 2018

subscriber_not_composable always throws an error after Ctrl-C ros2/examples#209

Closed

richiware changed the title ~~Hang in ROS2 Fast-RTPS during destruction~~ Hang in ROS2 Fast-RTPS during destruction [3094] Dec 5, 2018

MiguelCompany mentioned this issue Jun 21, 2019

Fix closing multicast UDP channel with whitelist [5732] #569

Merged

clalancette closed this as completed Nov 25, 2019

guru-florida mentioned this issue Nov 19, 2020

SEGV caused by order of destruction of Node sub-interfaces ros2/rclcpp#1468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in ROS2 Fast-RTPS during destruction [3094] #235

Hang in ROS2 Fast-RTPS during destruction [3094] #235

clalancette commented Jun 22, 2018

anfrox commented Aug 2, 2018

richiware commented Nov 16, 2018

alsora commented Jan 24, 2019

clalancette commented Jan 24, 2019

ssnover commented Jun 12, 2019

MiguelCompany commented Jun 12, 2019

MiguelCompany commented Jun 26, 2019

LuisGP commented Sep 9, 2019

clalancette commented Nov 25, 2019

Hang in ROS2 Fast-RTPS during destruction [3094] #235

Hang in ROS2 Fast-RTPS during destruction [3094] #235

Comments

clalancette commented Jun 22, 2018

anfrox commented Aug 2, 2018

richiware commented Nov 16, 2018

alsora commented Jan 24, 2019

clalancette commented Jan 24, 2019

ssnover commented Jun 12, 2019

MiguelCompany commented Jun 12, 2019

MiguelCompany commented Jun 26, 2019

LuisGP commented Sep 9, 2019

clalancette commented Nov 25, 2019