Threads should have their own signal handler #192

dhood · 2018-06-19T02:01:25Z

Before rcl_wait is called, an existing signal handler is stored so that it can be called from within rclpy's signal handler. The existing signal handler is restored after waking up.

This is not currently robust to rclpy_wait being called from multiple threads; each thread ought to store its own signal handler to restore.

The text was updated successfully, but these errors were encountered:

wjwwood · 2018-06-19T02:09:09Z

Well, I'm not sure that's the right fix for this. Instead, maybe there should be a thread-agnostic signal handler in the Python layer that acts like the rclcpp signal handler which triggers the sigint guard condition for all wait sets:

So to recap, I think the issue is that rclpy_wait (which as you said calls rcl_wait) is not thread-safe and should be. I think the solution is to have one signal handler which is registered during "rclpy init" (either in C or Python, not sure which yet), rather than within rclpy_wait, and then multiple guard conditions, one for each wait set just like rclcpp.

I'm not sure it ever make sense to have one "stored signal handler to be restored" for each thread. I did mention that when we were talking in person, but after thinking about it more, I'm not sure that would work correctly.

pbaughman · 2019-02-01T22:53:17Z

@dejanpan

I think we're getting burned by this right now. We have a python process with multiple nodes and multiple executors on different threads, but our process is still blowing up with

"Executor.wait_for_ready_callbacks() called concurrently in multiple threads"

Apparently in c++ it's OK to run multiple executors on multiple threads, but in python we're being prevented from doing this by the global g_wait_set_spinning bool defined in the module https://github.com/ros2/rclpy/blob/master/rclpy/rclpy/executors.py#L34

We can work around this by having a single worker thread pump messages, but this will mean needing additional synchronization whenever we want to access data received by a subscriber from the 'main' thread

esteve · 2019-02-12T11:59:28Z

I think the following might work:

when a WaitSet is created, create a GuardCondition and add it as a
PyCapsule to the Python object for the WaitSet
in rclpy_wait, retrieve the GuardCondition from the PyCapsule
wait on the GuardCondition before calling rcl_wait, after
https://github.com/ros2/rclpy/blob/master/rclpy/src/rclpy/_rclpy.c#L2572
trigger the GuardCondition once rcl_wait finishes to let other rclpy_wait calls run

I'll go ahead and implement this, but feel free to add any feedback in
case there's something I'm missing.

esteve · 2019-02-12T16:17:13Z

@wjwwood I've been working on this, but then I realized that I'd have to call rcl_wait twice, once with a WaitSet that only contains the GuardCondition and a timeout set to -1, and then for the existing rcl_wait call. Is that what you thought or is there more to it? Thanks.

wjwwood · 2019-02-15T08:12:15Z

@esteve sorry, I haven't had time to try and wrap my head around it. I'll get back to you asap.

dejanpan · 2019-03-04T07:15:23Z

@esteve did you want to open a PR that @wjwwood or someone else can look at or how did we want to proceed with this one?

esteve · 2019-03-04T09:20:02Z

@dejanpan I wanted to run it by @wjwwood first to make sure I understood the problem correctly, but I'll create a pull request with what I have just to have something tangible that we discuss over.

wjwwood · 2019-03-05T20:53:31Z

I don't fully follow your second comment about needing to call it twice, but from the bullet list proposal, I'd wonder if it actually solves the issue, because if a waitset is waiting and has not work, but another waitset is waiting on it to finish before actually waiting despite the fact that it has data available, then you'd starve a ready waitset with a blocking waitset. Which effectively doesn't let you wait from two threads at the same time, if I understand it correctly.

Perhaps I don't have a grasp on the solution you've proposed yet. I'm happy to keep discussing it, but a pr, even a partial one, might be a more concrete thing to discuss.

Sorry for taking so long to look at it.

esteve · 2019-03-22T15:59:12Z

@wjwwood I pushed master...esteve:thread_safe_rclpy_wait but I'm not sure if that's the correct approach. @pbaughman do you have a test or a script that fails? I'd like to check if my branch fixed the bug. Thanks.

pbaughman · 2019-03-22T18:14:31Z

@esteve - I do not - I'm actually not exactly sure what subtle failures were seen before the big scary runtime error RuntimeError: Executor.wait_for_ready_callbacks() called concurrently in multiple threads' was added.

Maybe something like this:

import threading                                                  
import time                                                       
                                                                  
import rclpy                                                      
import rclpy.executors                                            
import std_msgs.msg                                               
                                                                  
                                                                  
class ThreadSpinner(threading.Thread):                            
                                                                  
    def __init__(self, node_name):                                
        super().__init__()                                        
        self._node = rclpy.create_node(node_name)                 
        self._sub = self._node.create_subscription(               
            std_msgs.msg.String,                                  
            "/chatter",                                           
            self._cb                                              
        )                                                         
                                                                  
    def run(self):                                                
        executor = rclpy.executors.SingleThreadedExecutor()       
        try:                                                      
            executor.add_node(self._node)                         
            while rclpy.ok():                                     
                executor.spin_once(1.1)                           
        finally:                                                  
            executor.remove_node(self._node)                      
                                                                  
    def _cb(self, msg):                                           
        print("{} got {}".format(self._node.get_name(), msg.data))
                                                                  
                                                                  
rclpy.init()                                                      
                                                                  
spinner1 = ThreadSpinner("Node1")                                 
spinner2 = ThreadSpinner("Node2")                                 
                                                                  
spinner1.start()                                                  
spinner2.start()                                                  
                                                                  
try:                                                              
    while True:                                                   
        time.sleep(100)                                           
except KeyboardInterrupt:                                         
    rclpy.shutdown()

terminal 1> ros2 run demo_nodes_py talker
terminal 2> python3 script.py
# Expected output:
Node1 got Hello World: 10
Node2 got Hello World: 10
Node1 got Hello World: 11
Node2 got Hello World: 11
Node1 got Hello World: 12
Node2 got Hello World: 12
# etc. . .

Would probably want to run it for a while to make sure both nodes continue to receive data

wjwwood · 2019-03-22T18:27:59Z

I'm actually not exactly sure what subtle failures were seen before the big scary runtime error

It is to do with the management of the signal handlers from within the C extension, as the description of the issue says:

Before rcl_wait is called, an existing signal handler is stored so that it can be called from within rclpy's signal handler. The existing signal handler is restored after waking up.

This is not currently robust to rclpy_wait being called from multiple threads; each thread ought to store its own signal handler to restore.

So, running it for a long time is not the issue, I think instead you'd need to run it, sigint it, run it again, in a loop until you see (or do not see) some failure or deadlock.

dhood added the enhancement New feature or request label Jun 19, 2018

dhood mentioned this issue Jun 22, 2018

Register our signal handler in rclpy_init vs each wait #194

Merged

clalancette mentioned this issue Dec 14, 2018

rclpy crash sometimes when hitting Ctrl-C #253

Closed

This was referenced Jan 12, 2019

executor.spin_once crashes after destroy_subscription called in separate thread #255

Closed

Destroy listeners outside of spin_once() ros-visualization/rqt#184

Closed

esteve added a commit to esteve/rclpy that referenced this issue Feb 12, 2019

First cut at fixing ros2#192

f4ec2cc

pbaughman mentioned this issue Mar 13, 2019

Improve apex launchtest ros ApexAI/apex_rostest#29

Merged

esteve added a commit to esteve/rclpy that referenced this issue Mar 22, 2019

First cut at fixing ros2#192

8aa80bf

esteve added a commit to esteve/rclpy that referenced this issue Mar 22, 2019

First cut at fixing ros2#192

6937a13

sloretz mentioned this issue Apr 5, 2019

Every executor gets its own SIGINT guard condition #308

Merged

sloretz self-assigned this Apr 5, 2019

sloretz added the in progress Actively being worked on (Kanban column) label Apr 5, 2019

sloretz added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Apr 18, 2019

sloretz closed this as completed in #308 Apr 19, 2019

sloretz removed the in review Waiting for review (Kanban column) label Apr 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads should have their own signal handler #192

Threads should have their own signal handler #192

dhood commented Jun 19, 2018

wjwwood commented Jun 19, 2018

pbaughman commented Feb 1, 2019 •

edited

Loading

esteve commented Feb 12, 2019

esteve commented Feb 12, 2019

wjwwood commented Feb 15, 2019

dejanpan commented Mar 4, 2019

esteve commented Mar 4, 2019

wjwwood commented Mar 5, 2019

esteve commented Mar 22, 2019

pbaughman commented Mar 22, 2019 •

edited

Loading

wjwwood commented Mar 22, 2019

Threads should have their own signal handler #192

Threads should have their own signal handler #192

Comments

dhood commented Jun 19, 2018

wjwwood commented Jun 19, 2018

pbaughman commented Feb 1, 2019 • edited Loading

esteve commented Feb 12, 2019

esteve commented Feb 12, 2019

wjwwood commented Feb 15, 2019

dejanpan commented Mar 4, 2019

esteve commented Mar 4, 2019

wjwwood commented Mar 5, 2019

esteve commented Mar 22, 2019

pbaughman commented Mar 22, 2019 • edited Loading

wjwwood commented Mar 22, 2019

pbaughman commented Feb 1, 2019 •

edited

Loading

pbaughman commented Mar 22, 2019 •

edited

Loading