Skip to content

Force rclrs_tests to use the UDPv4 transport instead of the default shared memory. #364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

maspe36
Copy link
Collaborator

@maspe36 maspe36 commented Feb 18, 2024

A few weeks ago, @luca-della-vedova pointed out that we had some occasional segfaults when running our unit tests in CI. I looked and sure enough we've pretty much always had some flaky tests for as long as this repo existed. Although the logs for these old runs aren't kept by Github, we can reasonably infer that they've been present since the beginning or near there.

A good place to start looking when you've got flakey tests, is to see if you're handling memory properly. I compiled rclrs_tests and ran them with valgrind.

Run Command
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./target/debug/deps/rclrs_tests-<hash>

The main takeaway here is that graph_tests.rs is using DDS (via rmw) in the test. This can be problematic (especially on CI) because by default, most DDS implementations default to shared memory, and well, shared memory segments aren't exactly process isolated. If another DDS process is running on the machine you run the risk of shared memory segments not being cleaned up properly which can (and in this case did) interfere with unit testing [1*].

I have attached the output of 3 separate valgrind logs, and pasted their summary below each one

  1. Run straight from the head of main
==18492== LEAK SUMMARY:
==18492==    definitely lost: 64 bytes in 1 blocks
==18492==    indirectly lost: 75 bytes in 1 blocks
==18492==      possibly lost: 768 bytes in 2 blocks
==18492==    still reachable: 174,397 bytes in 123 blocks
==18492==         suppressed: 0 bytes in 0 blocks
==18492== 
==18492== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

As you can see, we are 100% leaking some memory. If we look through the logs for the entries...

==18492== 139 (64 direct, 75 indirect) bytes in 1 blocks are definitely lost in loss record 24 of 36
==18492==    at 0x4E05833: operator new(unsigned long) (vg_replace_malloc.c:483)
==18492==    by 0x7FC3A23: ??? (in /opt/ros/rolling/lib/librosidl_typesupport_cpp.so)
==18492==    by 0x7636D90: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x7637B74: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x763C73F: rmw_create_node (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x494C8D0: rcl_node_init (in /opt/ros/rolling/lib/librcl.so)
==18492==    by 0x1BA567: rclrs::node::builder::NodeBuilder::build (builder.rs:273)
==18492==    by 0x1409D9: rclrs_tests::graph_tests::construct_test_graph (graph_tests.rs:16)
==18492==    by 0x1429CA: rclrs_tests::graph_tests::test_subscriptions (graph_tests.rs:89)
==18492==    by 0x14018B: rclrs_tests::graph_tests::test_subscriptions::{{closure}} (graph_tests.rs:87)
==18492==    by 0x14E9D5: core::ops::function::FnOnce::call_once (function.rs:250)
==18492==    by 0x18E63E: call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> (function.rs:250)
==18492==    by 0x18E63E: test::__rust_begin_short_backtrace (lib.rs:627)

...

==18492== 75 bytes in 1 blocks are indirectly lost in loss record 12 of 36
==18492==    at 0x4E050C5: malloc (vg_replace_malloc.c:442)
==18492==    by 0x498C225: rcutils_strndup (in /opt/ros/rolling/lib/librcutils.so)
==18492==    by 0x498D48E: rcutils_load_shared_library (in /opt/ros/rolling/lib/librcutils.so)
==18492==    by 0x4DBAAE2: rcpputils::SharedLibrary::SharedLibrary(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /opt/ros/rolling/lib/librcpputils.so)
==18492==    by 0x7FC3A33: ??? (in /opt/ros/rolling/lib/librosidl_typesupport_cpp.so)
==18492==    by 0x7636D90: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x7637B74: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x763C73F: rmw_create_node (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x494C8D0: rcl_node_init (in /opt/ros/rolling/lib/librcl.so)
==18492==    by 0x1BA567: rclrs::node::builder::NodeBuilder::build (builder.rs:273)
==18492==    by 0x1409D9: rclrs_tests::graph_tests::construct_test_graph (graph_tests.rs:16)
==18492==    by 0x1429CA: rclrs_tests::graph_tests::test_subscriptions (graph_tests.rs:89)

Its pretty clear it comes from FastDDS itself (Note the librmw_fastrtps_cpp.so references).

  1. Run main with graph_tests.rs commented out
==19709== LEAK SUMMARY:
==19709==    definitely lost: 0 bytes in 0 blocks
==19709==    indirectly lost: 0 bytes in 0 blocks
==19709==      possibly lost: 768 bytes in 2 blocks
==19709==    still reachable: 131,088 bytes in 2 blocks
==19709==         suppressed: 0 bytes in 0 blocks
==19709== 
==19709== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
  1. This PR
==27283== LEAK SUMMARY:
==27283==    definitely lost: 0 bytes in 0 blocks
==27283==    indirectly lost: 0 bytes in 0 blocks
==27283==      possibly lost: 768 bytes in 2 blocks
==27283==    still reachable: 174,397 bytes in 123 blocks
==27283==         suppressed: 0 bytes in 0 blocks
==27283==
==27283== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

These definitive and indirect leaks are not present in the valgrind log where we force the transport layer to be UDPv4 or comment it out completely.

So, in conclusion we can at least "fix" an obvious memory leak in rclrs_tests causing occasional build failures by forcing UDPv4 instead of shmem. This "fix" is of course contingent on ROS 2 defaulting to FastDDS (which it is thankfully for Humble, Iron, and for now, Rolling). I really wish there was a way to do this via the rmw API, but I am not aware of a way to do that.

Another option we have is to simply remove this test. I have seen relying on DDS in tests to be problematic and I would be happy to move away from integration tests masquerading as unit tests.

Note 1* Is from RTI (not FastDDS), but I believe this is a fundamental property of multiple processes using shared memory

@luca-della-vedova
Copy link
Collaborator

Thanks for the investigation! I wonder, do other client libraries really avoid any sort of test with DDS integration? Or do they run them in an "integration test" setting that is forced to be single threaded to avoid interaction between different parallel tests?

Copy link
Collaborator

@jhdcs jhdcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be a simple enough change for the moment. On the subject of disabling the integration/DDS tests, do we know what the other client libraries do? Do they use DDS in any of their tests?

@maspe36
Copy link
Collaborator Author

maspe36 commented Mar 12, 2024

We discussed this at length in the March 11th monthly meetup and came to the conclusion that this fix is likely just a bandaid over a deeper problem. Will close for now while further investigation is on going.

See #365, #366, #367, #369

@maspe36 maspe36 closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants