Skip to content

Small Memory Leak with Controller Server, Planner Server, and AMCL #1889

@Michael-Equi

Description

@Michael-Equi

Bug report

  • Operating System:
    • Ubuntu 20.04
  • ROS2 Version:
    • ROS2 Foxy binaries
  • Version or commit hash:
    • Navigation2 built from source (c65e3aa)
  • DDS implementation:
    • Cyclone DDS (version 0.7.3)

Steps to reproduce issue

Run the Nav2 stack with the TB3 simulation for a long duration while recording memory usage using the system monitor.

Expected behavior

Memory should reach a stable point at which it no longer increases on controller, planner, and AMCL nodes especially when the robot is not actively navigating or being moved around.

Actual behavior

Controller Server shows consistent memory increase at around 55MB/h (taken over 3 hours without moving robot)
Planner Server shows consistent memory increase at around 50MB/h (taken over 3 hours without moving robot)
AMCL shows consistent memory increase at around 25MB/h (taken over 3 hours without moving robot)
Gzserver and gzclient memory usage remains constant (perhaps this information helps in localizing the issue)

Additional information

I have been having issues with the controller and planner servers suddenly crashing after running for long periods of time especially with limited robot motion or use. In exploring the cause of this issue I have noticed what appear to be minor memory leaks on the troublesome nodes. I tried getting the controller server to crash in gdb so I could get a more helpful segfault error but have not been able to get it to crash under those conditions (really strange). Outside of gdb I have experienced frequent crashing of both the controller and planner server after a couple of hours of limited use or idling.

In monitoring the controller server with gdb I did see some errors but they did not seem to result in a fatal segmentation fault but may help debug the issue:

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'guard_condition_handle not from this implementation, at /tmp/binarydeb/ros-foxy-rmw-cyclonedds-cpp-0.7.3/src/rmw_node.cpp:2713, at /tmp/binarydeb/ros-foxy-rcl-1.1.6/src/rcl/guard_condition.c:160'

with this new error message:

  'guard condition implementation is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.6/src/rcl/guard_condition.c:171'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1595816288.567870974] [rcl]: Failed to get trigger guard condition in jump callback

and upon killing the controller server I got the error

Thread 1 "controller_serv" received signal SIGINT, Interrupt.
futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555559aeae8) at ../sysdeps/nptl/futex-internal.h:183
183	../sysdeps/nptl/futex-internal.h: No such file or directory.

While running the node, I receive lots of repeated warnings such as the one below
[planner_server-5] [INFO] [1595818100.434728681] [global_costmap.global_costmap_rclcpp_node]: Message Filter dropping message: frame 'scanner_link' at time 15239.380 for reason 'Unknown'

I am aware that the issues I am having could be caused by a multitude of factors many of which are not be in the scope of navigation2, but being that this seems to be a particularly "severe" issue with nav2 related software, I thought it would be best to start the issue here and see what similar experiences exist and what conditions are like on different computers/setups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions