You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SEGV is caused due to some sub-iface destructor code expecting living references to the node base or other interface but it was already released earlier in the Node destructor. Memory corruption occurs and eventually malloc checks abort later in the execution, usually somewhere in malloc_consolidate, FastRTPS library or waitfor collection resizing.
The destructor for the Node class is empty. Sure, all the sub-interfaces are shared_ptr and thus their destructors are called but many of these sub-interfaces reference each other. For example, all of the sub-interfaces reference at least the NodeBaseInterface so it must be released last. Furthermore, clock and time_source sub ifaces reference many others so those other dependencies should also stay alive until the clock and time_source is released.
NodeParametersInterface
For example, the NodeParametersInterface contains parameter service calls that captures a weak reference to the base interface. During destruction it de-references the weak reference, finds it is null and spits out this error:
[ERROR] [1605744975.811485921] [rclcpp]: Error in destruction of rcl service handle: the Node Handle was destructed too early. You will leak memory
NodeParametersInterface at least uses as weakptr to base iface and thus fails gracefully (with a leak) but I expect some of the other sub ifaces may not. I see many examples in my searches of similar crash backtraces as I've encountered, explained below.
Operating System:
Linux/Ubuntu 20.04 (probably all OSs)
Installation type:
latest apt binaries
Version or commit hash:
foxy, latest
DDS implementation:
Fast-RTPS
Client library (if applicable):
rclcpp
Steps to reproduce issue
The nature of these types of mem corruption can be hard to pin down since the debugger/malloc doesnt trip until much later in the execution. I encountered this error while creating and destroying many Nodes at runtime. Specifically, I wrote a node to do episodic parameter tuning/training for a humanoid robot in Gazebo simulation. For each episode, ROS nodes are created then torn-down (cycled). The SEGV would occur somewhere between episode 1 and ~15 and always occur immediately after a tear-down. It got progressively worse as I added more nodes to simulation that are cycled per episode. I would say any such program that continuously creates and destroys nodes will replicate this issue, such as the one in Issue #447 (I will test this and report).
Related Issues
I believe these issues may have encountered the same problem. In particular, the SEGV often happened in FastRTPS but I dont believe the problem is there, it just happens to be the next malloc operation. I encountered a number of crashes in wait_for_work related code.
Hang in ROS2 Fast-RTPS during destruction [3094] #235
In the destructor I added _sub_iface.reset() calls for each of the sub interfaces in reverse order as in the constructor. I was able to get over 500 episodes without crashes. I believe explicitly specifying the order of destruction is a safer way of unravelling the tear-down dependencies.
Node::~Node()
{
// release sub-interfaces in an order that allows them to consult with node_base
node_waitables_.reset();
node_time_source_.reset();
node_parameters_.reset();
node_clock_.reset();
node_services_.reset();
node_topics_.reset();
node_timers_.reset();
node_logging_.reset();
node_graph_.reset();
}
PR incoming.
The text was updated successfully, but these errors were encountered:
Looks like the associated PR has been merged. Can this be closed?
I didn't follow the entire discussion there but sounds like this is an okay fix for now, the more "appropriate" fix being more involved. If needed, a summarizing comment can be added here (and I don't know enough to write that comment), or we can close this and open a followup issue.
Bug report
The SEGV is caused due to some sub-iface destructor code expecting living references to the node base or other interface but it was already released earlier in the Node destructor. Memory corruption occurs and eventually malloc checks abort later in the execution, usually somewhere in malloc_consolidate, FastRTPS library or waitfor collection resizing.
The destructor for the Node class is empty. Sure, all the sub-interfaces are shared_ptr and thus their destructors are called but many of these sub-interfaces reference each other. For example, all of the sub-interfaces reference at least the NodeBaseInterface so it must be released last. Furthermore, clock and time_source sub ifaces reference many others so those other dependencies should also stay alive until the clock and time_source is released.
NodeParametersInterface
For example, the NodeParametersInterface contains parameter service calls that captures a weak reference to the base interface. During destruction it de-references the weak reference, finds it is null and spits out this error:
NodeParametersInterface at least uses as weakptr to base iface and thus fails gracefully (with a leak) but I expect some of the other sub ifaces may not. I see many examples in my searches of similar crash backtraces as I've encountered, explained below.
Steps to reproduce issue
The nature of these types of mem corruption can be hard to pin down since the debugger/malloc doesnt trip until much later in the execution. I encountered this error while creating and destroying many Nodes at runtime. Specifically, I wrote a node to do episodic parameter tuning/training for a humanoid robot in Gazebo simulation. For each episode, ROS nodes are created then torn-down (cycled). The SEGV would occur somewhere between episode 1 and ~15 and always occur immediately after a tear-down. It got progressively worse as I added more nodes to simulation that are cycled per episode. I would say any such program that continuously creates and destroys nodes will replicate this issue, such as the one in Issue #447 (I will test this and report).
Related Issues
I believe these issues may have encountered the same problem. In particular, the SEGV often happened in FastRTPS but I dont believe the problem is there, it just happens to be the next malloc operation. I encountered a number of crashes in wait_for_work related code.
Fix
In the destructor I added _sub_iface.reset() calls for each of the sub interfaces in reverse order as in the constructor. I was able to get over 500 episodes without crashes. I believe explicitly specifying the order of destruction is a safer way of unravelling the tear-down dependencies.
PR incoming.
The text was updated successfully, but these errors were encountered: