Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor.spin_once crashes after destroy_subscription called in separate thread #255

Closed
brawner opened this issue Dec 19, 2018 · 7 comments · Fixed by #318
Closed

executor.spin_once crashes after destroy_subscription called in separate thread #255

brawner opened this issue Dec 19, 2018 · 7 comments · Fixed by #318
Assignees
Labels
bug Something isn't working

Comments

@brawner
Copy link
Contributor

brawner commented Dec 19, 2018

This is a follow up from ros-visualization/rqt#182. Destroying subscriptions is not thread safe. Per this comment ros-visualization/rqt#181 (comment) I suppose rmw is not designed to be threadsafe. It would be very helpful for overall stability if there was a thread management solution in the client libraries.

Bug report

Required Info:

  • Operating System:
    • MacOS 10.13.6
  • Installation type:
    • Source
  • Version or commit hash:
  • DDS implementation:
    • Default (Fast-RTPS)

Steps to reproduce issue

Minimal example here https://github.com/brawner/crashing_subscription_destroyer

> python3 publisher.py &
> python3 subscriber.py

Change the code in subscriber.py to see an example that does not crash

if __name__ == '__main__':
    dont_segfault()
    #segfault()

Expected behavior

Node continues to spin and subscription is ended

Actual behavior

Locking/or thread management is required to safely destroy a subscription

Stacktrace

    Thread 6 Crashed:: RclpySpinner
0   librmw_fastrtps_shared_cpp.dylib	0x0000000104de81ae SubListener::hasData() + 126
1   librmw_fastrtps_shared_cpp.dylib	0x0000000104de7f43 check_wait_set_for_data(rmw_subscriptions_t const*, rmw_guard_conditions_t const*, rmw_services_t const*, rmw_clients_t const*) + 115
2   librmw_fastrtps_shared_cpp.dylib	0x0000000104de97c7 rmw_fastrtps_shared_cpp::__rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_wait_set_t*, rmw_time_t const*)::$_0::operator()() const + 39
3   librmw_fastrtps_shared_cpp.dylib	0x0000000104de983f bool std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> >, rmw_fastrtps_shared_cpp::__rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_wait_set_t*, rmw_time_t const*)::$_0>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > > const&, rmw_fastrtps_shared_cpp::__rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_wait_set_t*, rmw_time_t const*)::$_0) + 95
4   librmw_fastrtps_shared_cpp.dylib	0x0000000104de8d81 rmw_fastrtps_shared_cpp::__rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_wait_set_t*, rmw_time_t const*) + 2561
5   librmw_fastrtps_cpp.dylib     	0x0000000104d4558d rmw_wait + 61
6   librmw_implementation.dylib   	0x0000000103a83c74 rmw_wait + 132
7   librcl.dylib                  	0x00000001039651fa rcl_wait + 2410
8   _rclpy.cpython-37m-darwin.so  	0x000000010392eaa8 rclpy_wait + 200
9   org.python.python             	0x0000000102ece20f _PyMethodDef_RawFastCallKeywords + 236
10  org.python.python             	0x0000000102ecd8af _PyCFunction_FastCallKeywords + 44
11  org.python.python             	0x0000000102f63b2b call_function + 636
12  org.python.python             	0x0000000102f5c771 _PyEval_EvalFrameDefault + 7016
13  org.python.python             	0x0000000102ed94bf gen_send_ex + 242
14  org.python.python             	0x0000000102f59111 builtin_next + 99
15  org.python.python             	0x0000000102ece313 _PyMethodDef_RawFastCallKeywords + 496
16  org.python.python             	0x0000000102ecd8af _PyCFunction_FastCallKeywords + 44
17  org.python.python             	0x0000000102f63b2b call_function + 636
18  org.python.python             	0x0000000102f5c80f _PyEval_EvalFrameDefault + 7174
19  org.python.python             	0x0000000102f64432 _PyEval_EvalCodeWithName + 1835
20  org.python.python             	0x0000000102ecd874 _PyFunction_FastCallKeywords + 225
21  org.python.python             	0x0000000102f63ba0 call_function + 753
22  org.python.python             	0x0000000102f5c8b5 _PyEval_EvalFrameDefault + 7340
23  org.python.python             	0x0000000102f64432 _PyEval_EvalCodeWithName + 1835
24  org.python.python             	0x0000000102ecd874 _PyFunction_FastCallKeywords + 225
25  org.python.python             	0x0000000102f63ba0 call_function + 753
26  org.python.python             	0x0000000102f5c8b5 _PyEval_EvalFrameDefault + 7340
27  org.python.python             	0x0000000102ecdc8a function_code_fastcall + 112
28  org.python.python             	0x0000000102ece60d _PyObject_Call_Prepend + 150
29  org.python.python             	0x0000000102ecd9bd PyObject_Call + 136
30  sip.so                        	0x0000000104b72122 call_method + 76
31  sip.so                        	0x0000000104b6df78 sip_api_call_procedure_method + 152
32  QtCore.so                     	0x00000001043cc2e4 sipQThread::run() + 84
33  org.qt-project.QtCore         	0x00000001045d275e 0x1045a4000 + 190302
34  libsystem_pthread.dylib       	0x00007fff6fb5a661 _pthread_body + 340
35  libsystem_pthread.dylib       	0x00007fff6fb5a50d _pthread_start + 377
36  libsystem_pthread.dylib       	0x00007fff6fb59bf9 thread_start + 13
@brawner
Copy link
Contributor Author

brawner commented Dec 19, 2018

To reproduce the crash it is important that node.destroy_node() is not called immediately after node.destroy_subscription(), which is why the subscriber.py sleeps for a second afterward.

@wjwwood
Copy link
Member

wjwwood commented Dec 19, 2018

I suppose rmw is not designed to be threadsafe.

It is not, but the client libraries should be. This is definitely a bug in rclpy, but it might be related to using QThread (since it also has C extension code), can this be done with just threading.Thread?

@brawner
Copy link
Contributor Author

brawner commented Dec 19, 2018 via email

@wjwwood
Copy link
Member

wjwwood commented Dec 19, 2018

I'm not sure destroying the subscription is the right thing to do. In C++ you just let the subscription go out of scope, then the executor continues to hold ownership until it wakes up at which point it will release its lock too and the subscription can be destroyed.

@sloretz may have better insight into what should be done here or if the design of rclpy should be changed in some way.

@brawner
Copy link
Contributor Author

brawner commented Jan 12, 2019

Following up, it turns out that other destroy_* methods are not thread-safe, including destroy_node, and destroy_timer.

See the test functions in test_destroy_node_while_spin_once(), test_destroy_publisher_while_spin_once(), test_destroy_subscriptions_out_of_thread()

https://github.com/brawner/rclpy/blob/test_destruction_outside_thread/rclpy/test/test_destruction_outside_thread.py

@brawner
Copy link
Contributor Author

brawner commented Jan 12, 2019

It looks like this is a related issue with similar discussion #192

@sloretz
Copy link
Contributor

sloretz commented Apr 9, 2019

Following up, it turns out that other destroy_* methods are not thread-safe, including destroy_node, and destroy_timer.

Reproducible example with a timer
import threading
import time

import rclpy
from rclpy.executors import SingleThreadedExecutor

rclpy.init()

# Background executor for convenience
executor = SingleThreadedExecutor()
t = threading.Thread(target=executor.spin, daemon=True)
t.start()

try:
    node = rclpy.create_node('TestDestroy')
    timer = node.create_timer(0.1, lambda: print('Timer called'))
    executor.add_node(node)

    time.sleep(1)

    # All done with timer, better destroy it right?
    node.destroy_timer(timer)
finally:
    # Shutdown the executor since there's nothing left to execute
    executor.shutdown()

rclpy.shutdown()
Console output:
$ python3 destroy_timer_python.py
Timer called
Timer called
Timer called
Timer called
Timer called
Timer called
Timer called
Timer called
Timer called
Timer called
Segmentation fault (core dumped)

IIUC the node.destroy_*() methods are there to avoid relying on the python garbage collector. The concern is there is no guarantee the garbage collector will ever run.

First destroy_*() methods free memory. I don't think this matters though because nothing makes the memory used by rclpy special. If running out of memory because the garbage collector may never run was a real concern then it would also justify "foobar".destroy_string(). Second, these methods unsubscribe, stop timers, stop service servers, stop clients, etc. node.destroy_subscription(...) is the only way to remove a subscription from the node graph without waiting for the garbage collector. Of course if another thread is in rcl_wait() when this is called, then the rmw implementation will use memory after it was free'd when checking if a subscription has data.

File objects in python seem like good model to follow. Their memory is free'd by the garbage collector, but they also have close() to stop using the resource right away. It looks like this is what ROS 1 rospy did: see Subscriber.unregister(), and Service.shutdown().

Unfortunately, rmw and rcl don't have a separation between stopping a subscriber/publisher/service/timer/etc and freeing the memory used by them. It would be a lot of work to add APIs to all the rmw implementations. Maybe there's a way to make destroy_*() methods wake the executor so the rcl types can be safely fini'd?

@dirk-thomas dirk-thomas added the ready Work is about to start (Kanban column) label Apr 18, 2019
@sloretz sloretz added in progress Actively being worked on (Kanban column) and removed ready Work is about to start (Kanban column) labels Apr 23, 2019
@sloretz sloretz added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Apr 23, 2019
@sloretz sloretz removed the in review Waiting for review (Kanban column) label Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants