Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGV caused by order of destruction of Node sub-interfaces #1468

Closed
guru-florida opened this issue Nov 19, 2020 · 3 comments
Closed

SEGV caused by order of destruction of Node sub-interfaces #1468

guru-florida opened this issue Nov 19, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@guru-florida
Copy link
Contributor

Bug report

The SEGV is caused due to some sub-iface destructor code expecting living references to the node base or other interface but it was already released earlier in the Node destructor. Memory corruption occurs and eventually malloc checks abort later in the execution, usually somewhere in malloc_consolidate, FastRTPS library or waitfor collection resizing.

The destructor for the Node class is empty. Sure, all the sub-interfaces are shared_ptr and thus their destructors are called but many of these sub-interfaces reference each other. For example, all of the sub-interfaces reference at least the NodeBaseInterface so it must be released last. Furthermore, clock and time_source sub ifaces reference many others so those other dependencies should also stay alive until the clock and time_source is released.

NodeParametersInterface
For example, the NodeParametersInterface contains parameter service calls that captures a weak reference to the base interface. During destruction it de-references the weak reference, finds it is null and spits out this error:

[ERROR] [1605744975.811485921] [rclcpp]: Error in destruction of rcl service handle: the Node Handle was destructed too early. You will leak memory

NodeParametersInterface at least uses as weakptr to base iface and thus fails gracefully (with a leak) but I expect some of the other sub ifaces may not. I see many examples in my searches of similar crash backtraces as I've encountered, explained below.

  • Operating System:
    • Linux/Ubuntu 20.04 (probably all OSs)
  • Installation type:
    • latest apt binaries
  • Version or commit hash:
    • foxy, latest
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

The nature of these types of mem corruption can be hard to pin down since the debugger/malloc doesnt trip until much later in the execution. I encountered this error while creating and destroying many Nodes at runtime. Specifically, I wrote a node to do episodic parameter tuning/training for a humanoid robot in Gazebo simulation. For each episode, ROS nodes are created then torn-down (cycled). The SEGV would occur somewhere between episode 1 and ~15 and always occur immediately after a tear-down. It got progressively worse as I added more nodes to simulation that are cycled per episode. I would say any such program that continuously creates and destroys nodes will replicate this issue, such as the one in Issue #447 (I will test this and report).

Related Issues

I believe these issues may have encountered the same problem. In particular, the SEGV often happened in FastRTPS but I dont believe the problem is there, it just happens to be the next malloc operation. I encountered a number of crashes in wait_for_work related code.

Fix

In the destructor I added _sub_iface.reset() calls for each of the sub interfaces in reverse order as in the constructor. I was able to get over 500 episodes without crashes. I believe explicitly specifying the order of destruction is a safer way of unravelling the tear-down dependencies.

Node::~Node()
{
  // release sub-interfaces in an order that allows them to consult with node_base
  node_waitables_.reset();
  node_time_source_.reset();
  node_parameters_.reset();
  node_clock_.reset();
  node_services_.reset();
  node_topics_.reset();
  node_timers_.reset();
  node_logging_.reset();
  node_graph_.reset();
}

PR incoming.

@mabelzhang
Copy link

Looks like the associated PR has been merged. Can this be closed?

I didn't follow the entire discussion there but sounds like this is an okay fix for now, the more "appropriate" fix being more involved. If needed, a summarizing comment can be added here (and I don't know enough to write that comment), or we can close this and open a followup issue.

@guru-florida
Copy link
Contributor Author

Yes, fix as described here was merged. We discussed other options but opted for this simpler fix for now. Agreed, this ticket can be closed.

@fujitatomoya
Copy link
Collaborator

@mabelzhang @guru-florida thanks for checking and comments, it's been closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants