Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in ROS2 Fast-RTPS during destruction [3094] #235

Closed
clalancette opened this issue Jun 22, 2018 · 9 comments
Closed

Hang in ROS2 Fast-RTPS during destruction [3094] #235

clalancette opened this issue Jun 22, 2018 · 9 comments

Comments

@clalancette
Copy link
Contributor

I'm still debugging this, so I don't have all of the information. Nonetheless, this problem looks like it may be in Fast-RTPS, so opening this issue to get some visibility and maybe some guidance.

I'm currently debugging a failure in https://github.com/ros2/examples/blob/master/rclcpp/minimal_subscriber/not_composable.cpp ; the initial report is here: ros2/examples#209 . Running that code as-is causes the error message from that other issue.

While looking at the code with @wjwwood , however, we realized that this line is probably the culprit: https://github.com/ros2/examples/blob/master/rclcpp/minimal_subscriber/not_composable.cpp#L38 . That is, we force the node to be destroyed before the subscription (which will get destroyed when it goes out of scope). One easy solution is to just remove line 38. However, when I do that, the node hangs when I hit Ctrl-C. After doing some debugging in gdb, I see the following:

Thread 6 (Thread 0x7ffff14fd700 (LWP 31847)):
#0  0x00007ffff5ee49f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4409848 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, 
    mutex=0x7ffff4409880 <eprosima::fastrtps::rtps::AsyncWriterThread::condition_variable_mutex_>, 
    cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>, 
    mutex=0x7ffff4409880 <eprosima::fastrtps::rtps::AsyncWriterThread::condition_variable_mutex_>) at pthread_cond_wait.c:655
#3  0x00007ffff6bca620 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff3fbff48 in eprosima::fastrtps::rtps::AsyncWriterThread::run ()
    at /home/ubuntu/ros2_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/AsyncWriterThread.cpp:135
#5  0x00007ffff6bd0733 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5ede6db in start_thread (arg=0x7ffff14fd700) at pthread_create.c:463
#7  0x00007ffff662a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7fcd580 (LWP 31839)):
#0  0x00007ffff5ee4449 in futex_wait (private=<optimized out>, expected=12, 
    futex_word=0x7ffff4409844 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=12, 
    futex_word=0x7ffff4409844 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+36>) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7ffff4409820 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_>) at pthread_cond_destroy.c:54
#3  0x00007ffff654c041 in __run_exit_handlers (status=0, listp=0x7ffff68f4718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#4  0x00007ffff654c13a in __GI_exit (status=<optimized out>) at exit.c:139
#5  0x00007ffff652ab9e in __libc_start_main (main=0x5555555665d0 <main(int, char**)>, argc=1, argv=0x7fffffff6528, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff6518) at ../csu/libc-start.c:344
#6  0x00005555555661aa in _start ()

(I've elided the rest of the threads for brevity). It looks like what is happening is that AsyncWriterThread::run in Thread 6 is currently waiting to be woken up from the cv_ condition variable, implying it is holding the lock. Thread 1 looks to be trying to destroy the condition variable, but it is attempting to take the lock first, and this deadlocks. I'm still looking into this, but any thoughts or advice welcome. @richiware FYI.

@anfrox
Copy link

anfrox commented Aug 2, 2018

I see a similar problem without ROS2 on Windows 10 x64. On Win7 the same executable (build from source on the particular platform) did not hand during shutdown. Even if subscribers are removed before publishers, removing (the one and only) participant, there are one or both AsyncWriterThread threads active (not suspended/waiting). It looks like that the abort messages send via UDP gets not received while AsyncWriterThread threads are active. Minutes later the abort messages get received, the 3 blocking receive threads gets done and the join method returns. I monitored the UDP communication via Wireshark and it lists the UDP messages (length=13) also minutes later when the join returns.

@richiware
Copy link
Member

We cannot reproduce this behavior. New release 1.7.0 comes with changes in the network transport layers. Can you test this behavior still occurs? Thanks

@richiware richiware changed the title Hang in ROS2 Fast-RTPS during destruction Hang in ROS2 Fast-RTPS during destruction [3094] Dec 5, 2018
@alsora
Copy link
Contributor

alsora commented Jan 24, 2019

Is this issue solved with ROS2 Crystal release?

@clalancette
Copy link
Contributor Author

Sorry, I haven't had time to test this out again yet. I'll try to find some time in the next couple of days.

@ssnover
Copy link

ssnover commented Jun 12, 2019

I've just tried to recreate the issue as described here ros2/examples#209 under ROS 2 Dashing release and the issue persists.

@MiguelCompany
Copy link
Member

@ssnover95 our develop branch includes PR #540, which fixes a data race when destroying the participant that may be the cause of the problems you have (both the hung and the crash). Could you check using the develop branch of Fast-RTPS?

@MiguelCompany
Copy link
Member

@clalancette @ssnover95 In a couple of days we are going to release 1.8.1 (see #574), which includes #569 that fixes some hang cases when destroying a participant. Could you check if this issue is fixed with those changes?

@LuisGP
Copy link
Contributor

LuisGP commented Sep 9, 2019

@clalancette @ssnover95 Can we close this issue?

@clalancette
Copy link
Contributor Author

I'm going to close this out, as I haven't seen this particular issue in a while. If it comes up again, I'll re-open. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants