Force `rclrs_tests` to use the UDPv4 transport instead of the default shared memory. #364

maspe36 · 2024-02-18T05:34:25Z

A few weeks ago, @luca-della-vedova pointed out that we had some occasional segfaults when running our unit tests in CI. I looked and sure enough we've pretty much always had some flaky tests for as long as this repo existed. Although the logs for these old runs aren't kept by Github, we can reasonably infer that they've been present since the beginning or near there.

A good place to start looking when you've got flakey tests, is to see if you're handling memory properly. I compiled rclrs_tests and ran them with valgrind.

Run Command
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./target/debug/deps/rclrs_tests-<hash>

The main takeaway here is that graph_tests.rs is using DDS (via rmw) in the test. This can be problematic (especially on CI) because by default, most DDS implementations default to shared memory, and well, shared memory segments aren't exactly process isolated. If another DDS process is running on the machine you run the risk of shared memory segments not being cleaned up properly which can (and in this case did) interfere with unit testing [1*].

I have attached the output of 3 separate valgrind logs, and pasted their summary below each one

Run straight from the head of main

==18492== LEAK SUMMARY:
==18492==    definitely lost: 64 bytes in 1 blocks
==18492==    indirectly lost: 75 bytes in 1 blocks
==18492==      possibly lost: 768 bytes in 2 blocks
==18492==    still reachable: 174,397 bytes in 123 blocks
==18492==         suppressed: 0 bytes in 0 blocks
==18492== 
==18492== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

As you can see, we are 100% leaking some memory. If we look through the logs for the entries...

==18492== 139 (64 direct, 75 indirect) bytes in 1 blocks are definitely lost in loss record 24 of 36
==18492==    at 0x4E05833: operator new(unsigned long) (vg_replace_malloc.c:483)
==18492==    by 0x7FC3A23: ??? (in /opt/ros/rolling/lib/librosidl_typesupport_cpp.so)
==18492==    by 0x7636D90: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x7637B74: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x763C73F: rmw_create_node (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x494C8D0: rcl_node_init (in /opt/ros/rolling/lib/librcl.so)
==18492==    by 0x1BA567: rclrs::node::builder::NodeBuilder::build (builder.rs:273)
==18492==    by 0x1409D9: rclrs_tests::graph_tests::construct_test_graph (graph_tests.rs:16)
==18492==    by 0x1429CA: rclrs_tests::graph_tests::test_subscriptions (graph_tests.rs:89)
==18492==    by 0x14018B: rclrs_tests::graph_tests::test_subscriptions::{{closure}} (graph_tests.rs:87)
==18492==    by 0x14E9D5: core::ops::function::FnOnce::call_once (function.rs:250)
==18492==    by 0x18E63E: call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> (function.rs:250)
==18492==    by 0x18E63E: test::__rust_begin_short_backtrace (lib.rs:627)

...

==18492== 75 bytes in 1 blocks are indirectly lost in loss record 12 of 36
==18492==    at 0x4E050C5: malloc (vg_replace_malloc.c:442)
==18492==    by 0x498C225: rcutils_strndup (in /opt/ros/rolling/lib/librcutils.so)
==18492==    by 0x498D48E: rcutils_load_shared_library (in /opt/ros/rolling/lib/librcutils.so)
==18492==    by 0x4DBAAE2: rcpputils::SharedLibrary::SharedLibrary(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /opt/ros/rolling/lib/librcpputils.so)
==18492==    by 0x7FC3A33: ??? (in /opt/ros/rolling/lib/librosidl_typesupport_cpp.so)
==18492==    by 0x7636D90: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x7637B74: ??? (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x763C73F: rmw_create_node (in /opt/ros/rolling/lib/librmw_fastrtps_cpp.so)
==18492==    by 0x494C8D0: rcl_node_init (in /opt/ros/rolling/lib/librcl.so)
==18492==    by 0x1BA567: rclrs::node::builder::NodeBuilder::build (builder.rs:273)
==18492==    by 0x1409D9: rclrs_tests::graph_tests::construct_test_graph (graph_tests.rs:16)
==18492==    by 0x1429CA: rclrs_tests::graph_tests::test_subscriptions (graph_tests.rs:89)

Its pretty clear it comes from FastDDS itself (Note the librmw_fastrtps_cpp.so references).

Run main with graph_tests.rs commented out

==19709== LEAK SUMMARY:
==19709==    definitely lost: 0 bytes in 0 blocks
==19709==    indirectly lost: 0 bytes in 0 blocks
==19709==      possibly lost: 768 bytes in 2 blocks
==19709==    still reachable: 131,088 bytes in 2 blocks
==19709==         suppressed: 0 bytes in 0 blocks
==19709== 
==19709== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

This PR

==27283== LEAK SUMMARY:
==27283==    definitely lost: 0 bytes in 0 blocks
==27283==    indirectly lost: 0 bytes in 0 blocks
==27283==      possibly lost: 768 bytes in 2 blocks
==27283==    still reachable: 174,397 bytes in 123 blocks
==27283==         suppressed: 0 bytes in 0 blocks
==27283==
==27283== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

These definitive and indirect leaks are not present in the valgrind log where we force the transport layer to be UDPv4 or comment it out completely.

So, in conclusion we can at least "fix" an obvious memory leak in rclrs_tests causing occasional build failures by forcing UDPv4 instead of shmem. This "fix" is of course contingent on ROS 2 defaulting to FastDDS (which it is thankfully for Humble, Iron, and for now, Rolling). I really wish there was a way to do this via the rmw API, but I am not aware of a way to do that.

Another option we have is to simply remove this test. I have seen relying on DDS in tests to be problematic and I would be happy to move away from integration tests masquerading as unit tests.

Note 1* Is from RTI (not FastDDS), but I believe this is a fundamental property of multiple processes using shared memory

…ault shared memory.

luca-della-vedova · 2024-02-19T02:29:24Z

Thanks for the investigation! I wonder, do other client libraries really avoid any sort of test with DDS integration? Or do they run them in an "integration test" setting that is forced to be single threaded to avoid interaction between different parallel tests?

jhdcs

Looks to be a simple enough change for the moment. On the subject of disabling the integration/DDS tests, do we know what the other client libraries do? Do they use DDS in any of their tests?

maspe36 · 2024-03-12T02:26:09Z

We discussed this at length in the March 11th monthly meetup and came to the conclusion that this fix is likely just a bandaid over a deeper problem. Will close for now while further investigation is on going.

See #365, #366, #367, #369

maspe36 added 2 commits February 17, 2024 21:05

Force the rclrs_tests to use the UDPv4 transport instead of the def…

5da34bc

…ault shared memory.

Remove set UDP port

ff6b323

jhdcs approved these changes Feb 20, 2024

View reviewed changes

maspe36 closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force `rclrs_tests` to use the UDPv4 transport instead of the default shared memory. #364

Force `rclrs_tests` to use the UDPv4 transport instead of the default shared memory. #364

Uh oh!

maspe36 commented Feb 18, 2024

Uh oh!

luca-della-vedova commented Feb 19, 2024

Uh oh!

jhdcs left a comment

Uh oh!

maspe36 commented Mar 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Force rclrs_tests to use the UDPv4 transport instead of the default shared memory. #364

Force rclrs_tests to use the UDPv4 transport instead of the default shared memory. #364

Uh oh!

Conversation

maspe36 commented Feb 18, 2024

Uh oh!

luca-della-vedova commented Feb 19, 2024

Uh oh!

jhdcs left a comment

Choose a reason for hiding this comment

Uh oh!

maspe36 commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Force `rclrs_tests` to use the UDPv4 transport instead of the default shared memory. #364

Force `rclrs_tests` to use the UDPv4 transport instead of the default shared memory. #364

maspe36 commented Mar 12, 2024 •

edited

Loading