Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default RMW vendor to CycloneDDS. #293

Merged
merged 1 commit into from
Jan 12, 2021

Conversation

clalancette
Copy link
Contributor

Signed-off-by: Chris Lalancette clalancette@openrobotics.org

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status
  • Windows Debug Build Status
  • Linux CentOS Build Status

@ivanpauno
Copy link
Member

There's an interesting number of new failures, which might be expected considering that a lot of tests are only run with the default implementation.

It's strange that there aren't similar failures in the cyclonedds only rolling CI job.

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should investigate the failures before switching, but the code of this PR LGTM!

@clalancette
Copy link
Contributor Author

Maybe we should investigate the failures before switching, but the code of this PR LGTM!

Yeah, I think we really need to figure out what is going on with the test failures before merging. Otherwise all builds will be yellow going forward. @eboasson and I are planning to take a look at this, though not until early January. Any help in that direction is of course welcome.

@eboasson
Copy link

I think it is all related to this: ros2/rmw_cyclonedds#271 (comment) :

It looks to me like this fix was insufficient to resolve all the test failures that check the error string because of the work done in make_message_value_type and make_request_response_value_types. See, e.g., https://ci.ros2.org/job/ci_linux/13296/#showFailuresLink

I really think the underlying issue is that get_typesupport_handle_function in rosidl_typesupport_cpp sets an error message even though the function is being used to poll for the presence of some type support, and so a lookup failure should simply not be treated as an error at that point.

One way to fix this would be to not set a global error string in get_typesupport_handle_function and instead let the callers do it once they know what the error means. That would also eliminate the messing about with error strings in handling typesupports in the RMW implementations, which is an improvement in itself.

I think the nicest option would be to enhance the rosidl_typesupport interface so that one can specify the allowed "suffixes", then dispatch in the RMW implementation based on the one that was found. That sounds like it is more work in the typesupport code, but it would remove the need to fix all the RMW layers.

@clalancette
Copy link
Contributor Author

New CI incorporating ros2/rmw_cyclonedds#274:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status
  • Windows Debug Build Status
  • Linux CentOS Build Status
  • Linux clang Build Status

@eboasson
Copy link

eboasson commented Jan 5, 2021

@clalancette, ros2/rmw_cyclonedds#274 certainly did what it was expected to do 😄

I had a look through the errors and after filtering out uncrustify and cppcheck issues on Windows and CentOS a handful of things warranting further investigation remain. I'll try to get my Windows box back up and see if I can easily replicate that problem.

Linux

Looks to me like it is unrelated (Connext-specific)

  • projectroot.test_requester_replier__rclcpp__rclpy__rmw_connext_cpp
    test_communication.TestRequesterReplierAfterShutdown.test_processes_finished_gracefully[Empty]
  • projectroot.test.test_graph__rmw_connext_cpp
    rcl.test_graph__rmw_connext_cpp.gtest.missing_result

Linux-aarch64

One failure that I do find suspicious:

macOS

Looks to me like it is unrelated (Connext-specific)

  • projectroot.test.test_graph__rmw_connext_cpp
    rcl.TestGraphFixture__rmw_connext_cpp.test_graph_guard_condition_trigger_check

CentOS

Tons of cppcheck errors, plus:

and a number of them that I suspect may be real (the Mimick warnings are also present in the nightly build, but the test failures are not):

Linux-clang

Compiler warnings only, in urdfdom and tlsf (seems like it must be unrelated).

Windows/Windows Debug

Tons of uncrustify errors, plus a substantial amount of other failed tests. I looked at a couple of them and those at least all have the test print "PASSED" but the framework declare a failure with "return code 3221225477". That is 0xc000000005 which is also the exception code used for an access violation, so that looks like a crash on termination:

e.g. https://ci.ros2.org/job/ci_windows/13394/testReport/junit/projectroot.test/rclcpp/test_wait_set/

-- run_test.py: extra environment variables to append:
 - PATH+=C:/ci/ws/build/rclcpp/test/rclcpp/Release
-- run_test.py: invoking following command in 'C:\ci\ws\build\rclcpp\test\rclcpp':
 - C:/ci/ws/build/rclcpp/test/rclcpp/Release/test_wait_set.exe --gtest_output=xml:C:/ci/ws/build/rclcpp/test_results/rclcpp/test_wait_set.gtest.xml
Running main() from C:\ci\ws\install\src\gtest_vendor\src\gtest_main.cc
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from TestWaitSet
[ RUN      ] TestWaitSet.construction_and_destruction
[WARN] [1609811695.766400700] [rclcpp]: logging was initialized more than once
[       OK ] TestWaitSet.construction_and_destruction (0 ms)
[ RUN      ] TestWaitSet.get_rcl_wait_set
[       OK ] TestWaitSet.get_rcl_wait_set (0 ms)
[ RUN      ] TestWaitSet.add_remove_guard_condition
[       OK ] TestWaitSet.add_remove_guard_condition (1 ms)
[ RUN      ] TestWaitSet.add_guard_condition_to_two_different_wait_set
[       OK ] TestWaitSet.add_guard_condition_to_two_different_wait_set (62 ms)
[ RUN      ] TestWaitSet.add_remove_wait
[       OK ] TestWaitSet.add_remove_wait (44 ms)
[ RUN      ] TestWaitSet.get_result_from_wait_result
[       OK ] TestWaitSet.get_result_from_wait_result (1 ms)
[ RUN      ] TestWaitSet.get_result_from_wait_result_not_ready_error
[       OK ] TestWaitSet.get_result_from_wait_result_not_ready_error (10 ms)
[----------] 7 tests from TestWaitSet (118 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (127 ms total)
[  PASSED  ] 7 tests.
-- run_test.py: return code 3221225477
-- run_test.py: inject classname prefix into gtest result file 'C:/ci/ws/build/rclcpp/test_results/rclcpp/test_wait_set.gtest.xml'
-- run_test.py: verify result file 'C:/ci/ws/build/rclcpp/test_results/rclcpp/test_wait_set.gtest.xml'

@clalancette
Copy link
Contributor Author

@clalancette, ros2/rmw_cyclonedds#274 certainly did what it was expected to do smile

Yep! I've now merged it, so at least that piece of the puzzle is in.

I had a look through the errors and after filtering out uncrustify and cppcheck issues on Windows and CentOS a handful of things warranting further investigation remain. I'll try to get my Windows box back up and see if I can easily replicate that problem.

Yeah, the uncrustify errors popped up during break. Our buildfarmer is looking into it, but they have nothing to do with this PR.

Linux

Looks to me like it is unrelated (Connext-specific)

* projectroot.test_requester_replier__rclcpp__rclpy__rmw_connext_cpp
  test_communication.TestRequesterReplierAfterShutdown.test_processes_finished_gracefully[Empty]

While I agree that this looks Connext specific, the one thing that worries me is that this problem doesn't seem to pop up in the Linux release nightlies or repeated nightlies. I'll leave this at the bottom of the pile to worry about.

* projectroot.test.test_graph__rmw_connext_cpp
  rcl.test_graph__rmw_connext_cpp.gtest.missing_result

This one shows up pretty consistently in the nightlies, so it is not the fault of this PR.

Linux-aarch64

One failure that I do find suspicious:

* https://ci.ros2.org/job/ci_linux-aarch64/8271/testReport/junit/projectroot.test/rclcpp/test_node_interfaces__node_graph/

Agreed that this one looks suspicious. I don't see any evidence of it in the nightly aarch64 release or aarch64 repeated one. This needs to be looked into.

macOS

Looks to me like it is unrelated (Connext-specific)

* projectroot.test.test_graph__rmw_connext_cpp
  rcl.TestGraphFixture__rmw_connext_cpp.test_graph_guard_condition_trigger_check

This one shows up in the nightlies, so we can ignore it.

CentOS

Tons of cppcheck errors, plus:

* https://ci.ros2.org/job/ci_linux-centos/95/testReport/junit/sros2.test/test_mypy/test_mypy/ -- which merely declares "Found errors", which is also present in the regular nightly build

Yes, we can ignore that one; ros2/sros2#248 is an attempt to fix it.

and a number of them that I suspect may be real (the Mimick warnings are also present in the nightly build, but the test failures are not):

* https://ci.ros2.org/job/ci_linux-centos/95/testReport/junit/projectroot.test/rclcpp/test_node_interfaces__node_base/
  projectroot.test_lifecycle_node

* https://ci.ros2.org/job/ci_linux-centos/95/testReport/junit/rclcpp_lifecycle/TestDefaultStateMachine/test_graph_topics/

* https://ci.ros2.org/job/ci_linux-centos/95/testReport/junit/(root)/rclcpp/test_node_interfaces__node_base_gtest_missing_result/

Yeah, those bear looking into to. CentOS isn't currently officially supported, but we do try and keep it working.

Linux-clang

Compiler warnings only, in urdfdom and tlsf (seems like it must be unrelated).

Yeah, these are present in the nightlies, so we can ignore it.

Windows/Windows Debug

Tons of uncrustify errors, plus a substantial amount of other failed tests. I looked at a couple of them and those at least all have the test print "PASSED" but the framework declare a failure with "return code 3221225477". That is 0xc000000005 which is also the exception code used for an access violation, so that looks like a crash on termination:

OK, yeah, that looks like the biggest problem. @eboasson I'm going to leave the Windows problems to you for now, and take a look at the aarch64 test_node_interfaces__node_graph failure myself.

@eboasson
Copy link

eboasson commented Jan 5, 2021

I think I've caught the crash on Windows crash in the debugger (this is a debug build, BTW), saving it here just to make sure it isn't lost, one never knows with Windows ...

C:\dev\ros2_ws\build\rclcpp\test\rclcpp>cdb Debug\test_allocator_memory_strategy.exe

Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

CommandLine: Debug\test_allocator_memory_strategy.exe
Symbol search path is: srv*
Executable search path is:
ModLoad: 00007ff7`e2fe0000 00007ff7`e3223000   test_allocator_memory_strategy.exe
ModLoad: 00007ffc`25710000 00007ffc`25906000   ntdll.dll
ModLoad: 00007ffc`24270000 00007ffc`2432d000   C:\WINDOWS\System32\KERNEL32.DLL
ModLoad: 00007ffc`233c0000 00007ffc`23689000   C:\WINDOWS\System32\KERNELBASE.dll
ModLoad: 00007ffc`03530000 00007ffc`035b1000   C:\dev\ros2_ws\install\bin\rcl.dll
ModLoad: 00007ffc`144d0000 00007ffc`144f6000   C:\dev\ros2_ws\install\bin\libstatistics_collector.dll
ModLoad: 00007ffc`14500000 00007ffc`14522000   C:\dev\ros2_ws\install\bin\test_msgs__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`12380000 00007ffc`123a7000   C:\dev\ros2_ws\install\bin\rmw_implementation.dll
ModLoad: 00007ffc`12360000 00007ffc`1237d000   C:\dev\ros2_ws\install\bin\rmw.dll
ModLoad: 00007ffb`fc820000 00007ffb`fcc09000   C:\dev\ros2_ws\install\bin\rclcpp.dll
ModLoad: 00007ffc`11bc0000 00007ffc`11bd4000   C:\dev\ros2_ws\install\bin\statistics_msgs__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`11a50000 00007ffc`11a90000   C:\dev\ros2_ws\install\bin\rcutils.dll
ModLoad: 00007ffb`fc730000 00007ffb`fc817000   C:\WINDOWS\SYSTEM32\MSVCP140D.dll
ModLoad: 00007ffc`11b90000 00007ffc`11bba000   C:\WINDOWS\SYSTEM32\VCRUNTIME140D.dll
ModLoad: 00007ffc`1b4f0000 00007ffc`1b4ff000   C:\WINDOWS\SYSTEM32\VCRUNTIME140_1D.dll
ModLoad: 00007ffb`fa1c0000 00007ffb`fa383000   C:\WINDOWS\SYSTEM32\ucrtbased.dll
ModLoad: 00007ffc`11a30000 00007ffc`11a45000   C:\dev\ros2_ws\install\bin\action_msgs__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`11800000 00007ffc`11819000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_c.dll
ModLoad: 00007ffc`0c010000 00007ffc`0c02b000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`09120000 00007ffc`09144000   C:\dev\ros2_ws\install\bin\rcl_yaml_param_parser.dll
ModLoad: 00007ffc`038a0000 00007ffc`038c6000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_generator_c.dll
ModLoad: 00007ffc`03b50000 00007ffc`03b6b000   C:\dev\ros2_ws\install\bin\rosidl_runtime_c.dll
ModLoad: 00007ffb`fc5b0000 00007ffb`fc728000   C:\dev\ros2_ws\install\bin\rcl_logging_spdlog.dll
ModLoad: 00007ffc`037a0000 00007ffc`037c6000   C:\dev\ros2_ws\install\bin\rcpputils.dll
ModLoad: 00007ffc`03780000 00007ffc`0379b000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_c.dll
ModLoad: 00007ffc`036d0000 00007ffc`036e3000   C:\dev\ros2_ws\install\bin\rosgraph_msgs__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`03760000 00007ffc`0377c000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`02060000 00007ffc`020c8000   C:\dev\ros2_ws\install\bin\yaml.dll
ModLoad: 00007ffc`036b0000 00007ffc`036c4000   C:\dev\ros2_ws\install\bin\builtin_interfaces__rosidl_generator_c.dll
ModLoad: 00007ffc`01f80000 00007ffc`01f92000   C:\dev\ros2_ws\install\bin\rcl_logging_interface.dll
(4e8.e0c): Break instruction exception - code 80000003 (first chance)
ntdll!LdrpDoDebuggerBreak+0x30:
00007ffc`257e06d0 cc              int     3
0:000> g
Running main() from C:\dev\ros2_ws\install\src\gtest_vendor\src\gtest_main.cc
[==========] Running 29 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 29 tests from TestAllocatorMemoryStrategy
[ RUN      ] TestAllocatorMemoryStrategy.construct_destruct
ModLoad: 00007ffb`fb290000 00007ffb`fb3d7000   C:\dev\ros2_ws\install\bin\rmw_cyclonedds_cpp.dll
ModLoad: 00007ffc`00a80000 00007ffc`00adc000   C:\dev\ros2_ws\install\bin\rmw_dds_common.dll
ModLoad: 00007ffb`f8460000 00007ffb`f86b1000   C:\dev\ros2_ws\install\bin\ddsc.dll
ModLoad: 00007ffc`1ac80000 00007ffc`1ac8f000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`1aac0000 00007ffc`1aacf000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffc`249e0000 00007ffc`24a4b000   C:\WINDOWS\System32\WS2_32.dll
ModLoad: 00007ffc`01dd0000 00007ffc`01de4000   C:\dev\ros2_ws\install\bin\rmw_dds_common__rosidl_typesupport_cpp.dll
ModLoad: 00007ffb`fd8b0000 00007ffb`fd95c000   C:\WINDOWS\SYSTEM32\libssl-1_1-x64.dll
ModLoad: 00007ffc`22e50000 00007ffc`22f50000   C:\WINDOWS\System32\ucrtbase.dll
ModLoad: 000001e6`a4160000 000001e6`a44a9000   C:\WINDOWS\SYSTEM32\libcrypto-1_1-x64.dll
ModLoad: 00007ffb`f8110000 00007ffb`f8459000   C:\WINDOWS\SYSTEM32\libcrypto-1_1-x64.dll
ModLoad: 00007ffc`01d80000 00007ffc`01d99000   C:\WINDOWS\SYSTEM32\VCRUNTIME140.dll
ModLoad: 00007ffc`25140000 00007ffc`2526b000   C:\WINDOWS\System32\RPCRT4.dll
ModLoad: 00007ffc`24010000 00007ffc`240bc000   C:\WINDOWS\System32\ADVAPI32.dll
ModLoad: 00007ffc`23820000 00007ffc`238be000   C:\WINDOWS\System32\msvcrt.dll
ModLoad: 00007ffc`22260000 00007ffc`2229b000   C:\WINDOWS\SYSTEM32\IPHLPAPI.DLL
ModLoad: 00007ffc`25630000 00007ffc`256cc000   C:\WINDOWS\System32\sechost.dll
ModLoad: 00007ffc`24470000 00007ffc`24610000   C:\WINDOWS\System32\USER32.dll
ModLoad: 00007ffc`230e0000 00007ffc`23102000   C:\WINDOWS\System32\win32u.dll
ModLoad: 00007ffc`24440000 00007ffc`2446a000   C:\WINDOWS\System32\GDI32.dll
ModLoad: 00007ffc`22f50000 00007ffc`23059000   C:\WINDOWS\System32\gdi32full.dll
ModLoad: 00007ffc`232c0000 00007ffc`2335d000   C:\WINDOWS\System32\msvcp_win.dll
ModLoad: 00007ffc`23690000 00007ffc`236b7000   C:\WINDOWS\System32\bcrypt.dll
ModLoad: 00007ffc`243b0000 00007ffc`243e0000   C:\WINDOWS\System32\IMM32.DLL
ModLoad: 00007ffc`23060000 00007ffc`230e0000   C:\WINDOWS\System32\bcryptprimitives.dll
ModLoad: 00007ffc`208a0000 00007ffc`208b2000   C:\WINDOWS\SYSTEM32\kernel.appcore.dll
ModLoad: 00007ffc`25620000 00007ffc`25628000   C:\WINDOWS\System32\NSI.dll
ModLoad: 00007ffc`18b90000 00007ffc`18ba7000   C:\WINDOWS\SYSTEM32\dhcpcsvc6.DLL
ModLoad: 00007ffc`18ad0000 00007ffc`18aed000   C:\WINDOWS\SYSTEM32\dhcpcsvc.DLL
ModLoad: 00007ffc`222a0000 00007ffc`2236b000   C:\WINDOWS\SYSTEM32\DNSAPI.dll
ModLoad: 00007ffc`22570000 00007ffc`225da000   C:\WINDOWS\system32\mswsock.dll
ModLoad: 00007ffc`19040000 00007ffc`19057000   C:\WINDOWS\system32\napinsp.dll
ModLoad: 00007ffc`190b0000 00007ffc`190cb000   C:\WINDOWS\system32\pnrpnsp.dll
ModLoad: 00007ffc`15fe0000 00007ffc`15ff5000   C:\WINDOWS\system32\wshbth.dll
ModLoad: 00007ffc`18e20000 00007ffc`18e3d000   C:\WINDOWS\system32\NLAapi.dll
ModLoad: 00007ffc`1ac50000 00007ffc`1ac62000   C:\WINDOWS\System32\winrnr.dll
ModLoad: 00007ffc`01d60000 00007ffc`01d80000   C:\dev\ros2_ws\install\bin\rmw_dds_common__rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`00550000 00007ffc`0056c000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffc`00530000 00007ffc`00543000   C:\dev\ros2_ws\install\bin\builtin_interfaces__rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffb`fbe60000 00007ffb`fbea2000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`00510000 00007ffc`00524000   C:\dev\ros2_ws\install\bin\builtin_interfaces__rosidl_typesupport_introspection_cpp.dll
[       OK ] TestAllocatorMemoryStrategy.construct_destruct (273 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_remove_guard_conditions
[       OK ] TestAllocatorMemoryStrategy.add_remove_guard_conditions (3 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_remove_waitables
(4e8.e0c): C++ EH exception - code e06d7363 (first chance)
[       OK ] TestAllocatorMemoryStrategy.add_remove_waitables (3 ms)
[ RUN      ] TestAllocatorMemoryStrategy.number_of_entities_with_subscription
ModLoad: 00007ffb`fba10000 00007ffb`fbac0000   C:\dev\ros2_ws\install\bin\test_msgs__rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`004f0000 00007ffc`00505000   C:\dev\ros2_ws\install\bin\unique_identifier_msgs__rosidl_typesupport_introspection_cpp.dll
[       OK ] TestAllocatorMemoryStrategy.number_of_entities_with_subscription (163 ms)
[ RUN      ] TestAllocatorMemoryStrategy.number_of_entities_with_service
[       OK ] TestAllocatorMemoryStrategy.number_of_entities_with_service (107 ms)
[ RUN      ] TestAllocatorMemoryStrategy.number_of_entities_with_client
[       OK ] TestAllocatorMemoryStrategy.number_of_entities_with_client (93 ms)
[ RUN      ] TestAllocatorMemoryStrategy.number_of_entities_with_timer
[       OK ] TestAllocatorMemoryStrategy.number_of_entities_with_timer (108 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_bad_arguments
[ERROR] [1609865731.739643600] [rclcpp]: Couldn't add subscription to wait set: wait_set argument is null, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:323
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_bad_arguments (155 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_subscription
[ERROR] [1609865731.896142900] [rclcpp]: Couldn't add subscription to wait set: subscriptions set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:323
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_subscription (78 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_service

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'subscriptions set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:323'

with this new error message:

  'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at C:\dev\ros2_ws\src\ros2\rosidl_typesupport\rosidl_typesupport_cpp\src\type_support_dispatch.hpp:113'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1609865731.981892300] [rclcpp]: Couldn't add service to wait set: services set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:504
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_service (107 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_client

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'services set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:504'

with this new error message:

  'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at C:\dev\ros2_ws\src\ros2\rosidl_typesupport\rosidl_typesupport_cpp\src\type_support_dispatch.hpp:113'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1609865732.098596600] [rclcpp]: Couldn't add client to wait set: clients set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:493
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_client (117 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_guard_condition

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'clients set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:493'

with this new error message:

  'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at C:\dev\ros2_ws\src\ros2\rosidl_typesupport\rosidl_typesupport_cpp\src\type_support_dispatch.hpp:113'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1609865732.223195200] [rclcpp]: Couldn't add guard_condition to wait set: guard_conditions set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:459
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_guard_condition (149 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_timer

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'guard_conditions set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:459'

with this new error message:

  'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at C:\dev\ros2_ws\src\ros2\rosidl_typesupport\rosidl_typesupport_cpp\src\type_support_dispatch.hpp:113'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1609865732.372875200] [rclcpp]: Couldn't add timer to wait set: timers set is full, at C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\wait.c:473
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_timer (92 ms)
[ RUN      ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_waitable
[ERROR] [1609865732.411448200] [rclcpp]: Couldn't add waitable to wait set: error not set
[       OK ] TestAllocatorMemoryStrategy.add_handles_to_wait_set_waitable (7 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_subscription
[       OK ] TestAllocatorMemoryStrategy.get_next_subscription (258 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_service
[       OK ] TestAllocatorMemoryStrategy.get_next_service (815 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_client
[       OK ] TestAllocatorMemoryStrategy.get_next_client (796 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_timer
[       OK ] TestAllocatorMemoryStrategy.get_next_timer (811 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_waitable
[       OK ] TestAllocatorMemoryStrategy.get_next_waitable (812 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_subscription_mutually_exclusive
[       OK ] TestAllocatorMemoryStrategy.get_next_subscription_mutually_exclusive (310 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_service_mutually_exclusive
[       OK ] TestAllocatorMemoryStrategy.get_next_service_mutually_exclusive (370 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_client_mutually_exclusive
[       OK ] TestAllocatorMemoryStrategy.get_next_client_mutually_exclusive (296 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_timer_mutually_exclusive
[       OK ] TestAllocatorMemoryStrategy.get_next_timer_mutually_exclusive (312 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_waitable_mutually_exclusive
[       OK ] TestAllocatorMemoryStrategy.get_next_waitable_mutually_exclusive (317 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_subscription_out_of_scope
[       OK ] TestAllocatorMemoryStrategy.get_next_subscription_out_of_scope (192 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_service_out_of_scope
[       OK ] TestAllocatorMemoryStrategy.get_next_service_out_of_scope (156 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_client_out_of_scope
[       OK ] TestAllocatorMemoryStrategy.get_next_client_out_of_scope (155 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_timer_out_of_scope
[       OK ] TestAllocatorMemoryStrategy.get_next_timer_out_of_scope (155 ms)
[ RUN      ] TestAllocatorMemoryStrategy.get_next_waitable_out_of_scope
[       OK ] TestAllocatorMemoryStrategy.get_next_waitable_out_of_scope (155 ms)
[----------] 29 tests from TestAllocatorMemoryStrategy (7430 ms total)

[----------] Global test environment tear-down
[==========] 29 tests from 1 test suite ran. (7439 ms total)
[  PASSED  ] 29 tests.
(4e8.e0c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rmw_cyclonedds_cpp.dll
rmw_cyclonedds_cpp!clean_waitset_caches+0x8a:
00007ffb`fb2ae90a 488b00          mov     rax,qword ptr [rax] ds:feeefeee`feeefefe=????????????????
0:000> k
Child-SP          RetAddr           Call Site
00000062`968fa3a0 00007ffb`fb2ad1ac rmw_cyclonedds_cpp!clean_waitset_caches+0x8a
00000062`968fa450 00007ffb`fb2a87c8 rmw_cyclonedds_cpp!destroy_guard_condition+0x6c
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rmw_implementation.dll
00000062`968fa4a0 00007ffc`12386251 rmw_cyclonedds_cpp!rmw_destroy_guard_condition+0x28
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rcl.dll
00000062`968fa4d0 00007ffc`03545445 rmw_implementation!rmw_destroy_guard_condition+0x61
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rclcpp.dll
00000062`968fa510 00007ffb`fc8c81d2 rcl!rcl_guard_condition_fini+0xd5
00000062`968fa9a0 00007ffb`fc8c77de rclcpp!rclcpp::graph_listener::GraphListener::__shutdown+0xc2
00000062`968faa20 00007ffb`fc8c6fde rclcpp!rclcpp::graph_listener::GraphListener::shutdown+0x4e
00000062`968fcc40 00007ffb`fc8cbede rclcpp!rclcpp::graph_listener::GraphListener::~GraphListener+0x3e
00000062`968fcc70 00007ffb`fc926d79 rclcpp!rclcpp::graph_listener::GraphListener::`vector deleting destructor'+0x8e
00000062`968fcca0 00007ffb`fc928490 rclcpp!<lambda_7d4fb7196c5dd46371ba9d14810f4ed5>::operator()+0x49
00000062`968fcce0 00007ffb`fc8546a7 rclcpp!std::_Ref_count_resource<rclcpp::graph_listener::GraphListener *,<lambda_7d4fb7196c5dd46371ba9d14810f4ed5> >::_Destroy+0x40
00000062`968fcd10 00007ffb`fc854648 rclcpp!std::_Ref_count_base::_Decref+0x47
00000062`968fcd40 00007ffb`fc852388 rclcpp!std::_Ptr_base<void>::_Decref+0x38
00000062`968fcd70 00007ffb`fc87675f rclcpp!std::shared_ptr<void>::~shared_ptr<void>+0x28
00000062`968fcda0 00007ffb`fc87800c rclcpp!std::pair<std::type_index const ,std::shared_ptr<void> >::~pair<std::type_index const ,std::shared_ptr<void> >+0x2f
00000062`968fcdd0 00007ffb`fc8722df rclcpp!std::pair<std::type_index const ,std::shared_ptr<void> >::`scalar deleting destructor'+0x2c
00000062`968fce00 00007ffb`fc870541 rclcpp!std::_Default_allocator_traits<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >::destroy<std::pair<std::type_index const ,std::shared_ptr<void> > >+0x2f
00000062`968fce30 00007ffb`fc870450 rclcpp!std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *>::_Freenode<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >+0x41
00000062`968fce60 00007ffb`fc87c2ad rclcpp!std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *>::_Free_non_head<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >+0x70
00000062`968fcea0 00007ffb`fc87c1f9 rclcpp!std::list<std::pair<std::type_index const ,std::shared_ptr<void> >,std::allocator<std::pair<std::type_index const ,std::shared_ptr<void> > > >::clear+0x4d
00000062`968fcee0 00007ffb`fc86ccac rclcpp!std::_Hash<std::_Umap_traits<std::type_index,std::shared_ptr<void>,std::_Uhash_compare<std::type_index,std::hash<std::type_index>,std::equal_to<std::type_index> >,std::allocator<std::pair<std::type_index const ,std::shared_ptr<void> > >,0> >::clear+0x89
00000062`968fcf30 00007ffb`fc86bb1b rclcpp!rclcpp::Context::clean_up+0x5c
00000062`968fcf60 00007ffb`fc87f908 rclcpp!rclcpp::Context::~Context+0xab
00000062`968ff180 00007ffb`fc87f9bc rclcpp!rclcpp::contexts::DefaultContext::~DefaultContext+0x28
00000062`968ff1b0 00007ffb`fc87f39f rclcpp!rclcpp::contexts::DefaultContext::`scalar deleting destructor'+0x2c
00000062`968ff1e0 00007ffb`fc87faff rclcpp!std::_Destroy_in_place<rclcpp::contexts::DefaultContext>+0x2f
00000062`968ff210 00007ffb`fc8546a7 rclcpp!std::_Ref_count_obj2<rclcpp::contexts::DefaultContext>::_Destroy+0x2f
00000062`968ff240 00007ffb`fc87fa38 rclcpp!std::_Ref_count_base::_Decref+0x47
00000062`968ff270 00007ffb`fc87f8c8 rclcpp!std::_Ptr_base<rclcpp::contexts::DefaultContext>::_Decref+0x38
00000062`968ff2a0 00007ffb`fcab73f1 rclcpp!std::shared_ptr<rclcpp::contexts::DefaultContext>::~shared_ptr<rclcpp::contexts::DefaultContext>+0x28
00000062`968ff2d0 00007ffb`fa235107 rclcpp!`rclcpp::contexts::get_global_default_context'::`2'::`dynamic atexit destructor for 'default_context''+0x21
00000062`968ff300 00007ffb`fa234b15 ucrtbased!<lambda_d121dba8a4adeaf3a9819e48611155df>::operator()+0x127
00000062`968ff390 00007ffb`fa234c4a ucrtbased!__crt_seh_guarded_call<int>::operator()<<lambda_6a47f4c8fd0152770a780fc1d70204eb>,<lambda_d121dba8a4adeaf3a9819e48611155df> &,<lambda_6aaa2265f5b6a89667e7d7630012e97a> >+0x35
00000062`968ff3d0 00007ffb`fa2352b1 ucrtbased!__acrt_lock_and_call<<lambda_d121dba8a4adeaf3a9819e48611155df> >+0x4a
00000062`968ff420 00007ffb`fca78089 ucrtbased!_execute_onexit_table+0x31
00000062`968ff460 00007ffb`fca78cbc rclcpp!__scrt_dllmain_uninitialize_c+0x19
00000062`968ff490 00007ffb`fca78a97 rclcpp!dllmain_crt_process_detach+0x4c
00000062`968ff4d0 00007ffb`fca78e2e rclcpp!dllmain_crt_dispatch+0x67
00000062`968ff510 00007ffb`fca78fa1 rclcpp!dllmain_dispatch+0xfe
00000062`968ff560 00007ffc`25747cad rclcpp!_DllMainCRTStartup+0x31
00000062`968ff590 00007ffc`25773ff1 ntdll!LdrpCallInitRoutine+0x61
00000062`968ff600 00007ffc`25773e8d ntdll!LdrShutdownProcess+0x141
00000062`968ff700 00007ffc`2428e0ab ntdll!RtlExitUserProcess+0xad
00000062`968ff730 00007ffb`fa23467a KERNEL32!ExitProcessImplementation+0xb
00000062`968ff760 00007ffb`fa234629 ucrtbased!exit_or_terminate_process+0x3a
00000062`968ff790 00007ffb`fa2349a6 ucrtbased!common_exit+0x79
*** WARNING: Unable to verify checksum for test_allocator_memory_strategy.exe
00000062`968ff7f0 00007ff7`e3138167 ucrtbased!exit+0x16
00000062`968ff820 00007ff7`e313800e test_allocator_memory_strategy!__scrt_common_main_seh+0x147
00000062`968ff890 00007ff7`e3138339 test_allocator_memory_strategy!__scrt_common_main+0xe
00000062`968ff8c0 00007ffc`24287034 test_allocator_memory_strategy!mainCRTStartup+0x9
00000062`968ff8f0 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000062`968ff920 00000000`00000000 ntdll!RtlUserThreadStart+0x21
0:000>

Edit, it's at the *:

static void clean_waitset_caches()
{
  /* Called whenever a subscriber, guard condition, service or client is deleted (as these may
     have been cached in a waitset), and drops all cached entities from all waitsets (just to keep
     life simple). I'm assuming one is not allowed to delete an entity while it is still being
     used ... */
  std::lock_guard<std::mutex> lock(gcdds.lock);
  for (auto && ws : gcdds.waitsets) {
*   std::lock_guard<std::mutex> wslock(ws->lock);
    if (!ws->inuse) {
      waitset_detach(ws);
    }
  }
}

in the current state

0:000> dv
         wslock = class std::lock_guard<std::mutex>
             ws = 0xfeeefeee`feeefefe
     <range>$L0 = 0x00007ffb`fb3ba630 { size=0x0 }
     <begin>$L0 = class std::_List_unchecked_const_iterator<std::_List_val<std::_List_simple_types<CddsWaitset *> >,std::_Iterator_base0>
       <end>$L0 = class std::_List_unchecked_const_iterator<std::_List_val<std::_List_simple_types<CddsWaitset *> >,std::_Iterator_base0>
           lock = class std::lock_guard<std::mutex>

There's some issue with destructor ordering, it seems.

@clalancette
Copy link
Contributor Author

From my side, I looked a bit into the aarch64 failure. On an unloaded machine, I got it to fail only once in about 2 hours of testing. After that, I added some load to the machine, and now I'm able to make it happen fairly regularly (though not every time). It looks like under load, calling node_graph()->get_publishers_info_by_topic sometimes doesn't have the expected publisher.

It looks like there can be a delay between create_publisher, and the publisher actually showing up in the graph. This was likely a problem before the switch to CycloneDDS as the default, but may have been exacerbated by some details of CycloneDDS. In any case, I've opened ros2/rclcpp#1503 to use wait_for_graph_change to make this more robust, which should take care of this failure.

I also had some time to look into the CentOS failures. They generally fall into two categories that I'll discuss separately below.

Injection failures

For example https://ci.ros2.org/job/ci_linux-centos/94/testReport/junit/(root)/projectroot/test_lifecycle_node/

The mimick library does not support inject_on_return on CentOS. This has something to do with how the PLT is built on that platform, though I admit to not understanding the details. When this fails, you typically get an error like:

[WARNING] mocking_utils::inject_on_return() cannot forward call to original 'foo' function before injection

This has been a source of irritation in the past, though we have mostly worked around it at this point. It seems that a few more workarounds would be needed to make it fully work with CycloneDDS as the default. It is typically, though not always, possible to rewrite these tests to use patch_and_return or some other variant to handle these cases. Because of this, and because CentOS isn't a Tier 1 platform, I'm not going to worry about these failures right now.

Detecting type support failures

For example https://ci.ros2.org/job/ci_linux-centos/94/testReport/junit/projectroot.test/rclcpp/test_static_single_threaded_executor/

Even with the fix in ros2/rmw_cyclonedds#274 in place, we are still hitting areas of the code in CentOS where an error is set that is just not expected. Because of this, I'm going to look into solving ros2/rosidl_typesupport#100 now. Hopefully that will solve it for good, though we'll need a bunch of testing with different rmw vendors to make sure there are no side-effects.

@eboasson
Copy link

eboasson commented Jan 6, 2021

I think this is actually a bug in the rclcpp tests (or perhaps rclcpp itself) that's causing trouble with rmw_cyclonedds_cpp because of global destructor ordering. Below is a slightly shortened version of a cdb log and it shows the global destructor(s) of rmw_cyclonedds_cpp running before rclcpp stops using the RMW interface. I think that's broken ...

The global state that is being destroyed is really not very interesting, just a few mutexes, an empty map and an empty set. Perhaps it would be an option to simply do nothing in Cdds::~Cdds() but I don't know how to do that, nor what the implications of that might be.

I figured I could avoid the problem by converting the global variable into std::atomic<std::shared_ptr<Cdds>> because that would make it possible to keep it alive until the last reference disappears. Unfortunately, that type got introduced in C++20 ... Another option could be to rewrite the tests to not rely on global destructors, but I don't know if that'd be considered an acceptable option.

So, if anyone has a good suggestion ...

(2484.2754): Break instruction exception - code 80000003 (first chance)
ntdll!LdrpDoDebuggerBreak+0x30:
00007ffc`257e06d0 cc              int     3
0:000> bp rmw_cyclonedds_cpp!Cdds::~Cdds
Bp expression 'rmw_cyclonedds_cpp!Cdds::~Cdds' could not be resolved, adding deferred bp
0:000> g
Running main() from C:\dev\ros2_ws\install\src\gtest_vendor\src\gtest_main.cc
[==========] Running 20 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 17 tests from TestPublisher
<snip>
[----------] Global test environment tear-down
[==========] 20 tests from 3 test suites ran. (2839 ms total)
[  PASSED  ] 20 tests.
Breakpoint 0 hit
rmw_cyclonedds_cpp!Cdds::~Cdds:
00007ffc`038372e0 48894c2408      mov     qword ptr [rsp+8],rcx ss:0000008f`b72ff740=cccccccccccccccc


0:000> dx gcdds
gcdds            [Type: Cdds]
    [+0x000] lock             [Type: std::mutex]
    [+0x050] domains_lock     [Type: std::mutex]
    [+0x0a0] domains          : { size=0x0 } [Type: std::map<unsigned int,CddsDomain,std::less<unsigned int>,std::allocator<std::pair<unsigned int const ,CddsDomain> > >]
    [+0x0b8] gc_for_empty_waitset : 0 [Type: int]
    [+0x0c0] waitsets         : { size=0x0 } [Type: std::unordered_set<CddsWaitset *,std::hash<CddsWaitset *>,std::equal_to<CddsWaitset *>,std::allocator<CddsWaitset *> >]
0:000> k
Child-SP          RetAddr           Call Site
0000008f`b72ff738 00007ffc`038a99e1 rmw_cyclonedds_cpp!Cdds::~Cdds
0000008f`b72ff740 00007ffc`08af5107 rmw_cyclonedds_cpp!`dynamic atexit destructor for 'gcdds''+0x21
0000008f`b72ff770 00007ffc`08af4b15 ucrtbased!<lambda_d121dba8a4adeaf3a9819e48611155df>::operator()+0x127 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 206]
0000008f`b72ff800 00007ffc`08af4c4a ucrtbased!__crt_seh_guarded_call<int>::operator()<<lambda_6a47f4c8fd0152770a780fc1d70204eb>,<lambda_d121dba8a4adeaf3a9819e48611155df> &,<lambda_6aaa2265f5b6a89667e7d7630012e97a> >+0x35 [vccrt\vcruntime\inc\internal_shared.h @ 204]
0000008f`b72ff840 00007ffc`08af52b1 ucrtbased!__acrt_lock_and_call<<lambda_d121dba8a4adeaf3a9819e48611155df> >+0x4a [minkernel\crts\ucrt\inc\corecrt_internal.h @ 975]
0000008f`b72ff890 00007ffc`038979c9 ucrtbased!_execute_onexit_table+0x31 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 231]
0000008f`b72ff8d0 00007ffc`038991cc rmw_cyclonedds_cpp!__scrt_dllmain_uninitialize_c+0x19 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\utility\utility.cpp @ 399]
0000008f`b72ff900 00007ffc`03898fa7 rmw_cyclonedds_cpp!dllmain_crt_process_detach+0x4c [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 182]
0000008f`b72ff940 00007ffc`0389933e rmw_cyclonedds_cpp!dllmain_crt_dispatch+0x67 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 220]
0000008f`b72ff980 00007ffc`038994b1 rmw_cyclonedds_cpp!dllmain_dispatch+0xfe [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 293]
0000008f`b72ff9d0 00007ffc`25747cad rmw_cyclonedds_cpp!_DllMainCRTStartup+0x31 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 335]
0000008f`b72ffa00 00007ffc`25773ff1 ntdll!LdrpCallInitRoutine+0x61
0000008f`b72ffa70 00007ffc`25773e8d ntdll!LdrShutdownProcess+0x141
0000008f`b72ffb70 00007ffc`2428e0ab ntdll!RtlExitUserProcess+0xad
0000008f`b72ffba0 00007ffc`08af467a KERNEL32!ExitProcessImplementation+0xb
0000008f`b72ffbd0 00007ffc`08af4629 ucrtbased!exit_or_terminate_process+0x3a [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 144]
0000008f`b72ffc00 00007ffc`08af49a6 ucrtbased!common_exit+0x79 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 282]
*** WARNING: Unable to verify checksum for test_publisher.exe
0000008f`b72ffc60 00007ff6`0ba559d7 ucrtbased!exit+0x16 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 294]
0000008f`b72ffc90 00007ff6`0ba5587e test_publisher!__scrt_common_main_seh+0x147 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 297]
0000008f`b72ffd00 00007ff6`0ba55b69 test_publisher!__scrt_common_main+0xe [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331]
0000008f`b72ffd30 00007ffc`24287034 test_publisher!mainCRTStartup+0x9 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17]
0000008f`b72ffd60 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
0000008f`b72ffd90 00000000`00000000 ntdll!RtlUserThreadStart+0x21



0:000> g
(2484.2754): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
rmw_cyclonedds_cpp!clean_waitset_caches+0x8a:
00007ffc`037ee97a 488b00          mov     rax,qword ptr [rax] ds:feeefeee`feeefefe=????????????????



0:000> k
Child-SP          RetAddr           Call Site
0000008f`b72fa810 00007ffc`037ed21c rmw_cyclonedds_cpp!clean_waitset_caches+0x8a [C:\dev\ros2_ws\src\ros2\rmw_cyclonedds\rmw_cyclonedds_cpp\src\rmw_node.cpp @ 3225]
0000008f`b72fa8c0 00007ffc`037e87c8 rmw_cyclonedds_cpp!rmw_context_impl_t::init+0x1c [C:\dev\ros2_ws\src\ros2\rmw_cyclonedds\rmw_cyclonedds_cpp\src\rmw_node.cpp @ 3055]
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rmw_implementation.dll
0000008f`b72fa910 00007ffc`144d6251 rmw_cyclonedds_cpp!rmw_destroy_guard_condition+0x28 [C:\dev\ros2_ws\src\ros2\rmw_cyclonedds\rmw_cyclonedds_cpp\src\rmw_node.cpp @ 3064]
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rcl.dll
0000008f`b72fa940 00007ffc`0ced5445 rmw_implementation!rmw_destroy_guard_condition+0x61 [C:\dev\ros2_ws\src\ros2\rmw_implementation\rmw_implementation\src\functions.cpp @ 471]
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rclcpp.dll
0000008f`b72fa980 00007ffb`fc8c81d2 rcl!rcl_guard_condition_fini+0xd5 [C:\dev\ros2_ws\src\ros2\rcl\rcl\src\rcl\guard_condition.c @ 132]
0000008f`b72fae10 00007ffb`fc8c77de rclcpp!rclcpp::graph_listener::GraphListener::__shutdown+0xc2 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 354]
0000008f`b72fae90 00007ffb`fc8c6fde rclcpp!rclcpp::graph_listener::GraphListener::shutdown+0x4e [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 374]
0000008f`b72fd0b0 00007ffb`fc8cbede rclcpp!rclcpp::graph_listener::GraphListener::~GraphListener+0x3e [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 61]
0000008f`b72fd0e0 00007ffb`fc926d79 rclcpp!rclcpp::graph_listener::GraphListener::`vector deleting destructor'+0x8e
0000008f`b72fd110 00007ffb`fc928490 rclcpp!<lambda_7d4fb7196c5dd46371ba9d14810f4ed5>::operator()+0x49 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\include\rclcpp\context.hpp @ 266]
0000008f`b72fd150 00007ffb`fc8546a7 rclcpp!std::_Ref_count_resource<rclcpp::graph_listener::GraphListener *,<lambda_7d4fb7196c5dd46371ba9d14810f4ed5> >::_Destroy+0x40 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 702]
0000008f`b72fd180 00007ffb`fc854648 rclcpp!std::_Ref_count_base::_Decref+0x47 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 644]
0000008f`b72fd1b0 00007ffb`fc852388 rclcpp!std::_Ptr_base<void>::_Decref+0x38 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 878]
0000008f`b72fd1e0 00007ffb`fc87675f rclcpp!std::shared_ptr<void>::~shared_ptr<void>+0x28 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 1170]
0000008f`b72fd210 00007ffb`fc87800c rclcpp!std::pair<std::type_index const ,std::shared_ptr<void> >::~pair<std::type_index const ,std::shared_ptr<void> >+0x2f
0000008f`b72fd240 00007ffb`fc8722df rclcpp!std::pair<std::type_index const ,std::shared_ptr<void> >::`scalar deleting destructor'+0x2c
0000008f`b72fd270 00007ffb`fc870541 rclcpp!std::_Default_allocator_traits<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >::destroy<std::pair<std::type_index const ,std::shared_ptr<void> > >+0x2f [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\xmemory @ 707]
0000008f`b72fd2a0 00007ffb`fc870450 rclcpp!std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *>::_Freenode<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >+0x41 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\list @ 318]
0000008f`b72fd2d0 00007ffb`fc87c2ad rclcpp!std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *>::_Free_non_head<std::allocator<std::_List_node<std::pair<std::type_index const ,std::shared_ptr<void> >,void *> > >+0x70 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\list @ 330]
0000008f`b72fd310 00007ffb`fc87c1f9 rclcpp!std::list<std::pair<std::type_index const ,std::shared_ptr<void> >,std::allocator<std::pair<std::type_index const ,std::shared_ptr<void> > > >::clear+0x4d [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\list @ 1430]
0000008f`b72fd350 00007ffb`fc86ccac rclcpp!std::_Hash<std::_Umap_traits<std::type_index,std::shared_ptr<void>,std::_Uhash_compare<std::type_index,std::hash<std::type_index>,std::equal_to<std::type_index> >,std::allocator<std::pair<std::type_index const ,std::shared_ptr<void> > >,0> >::clear+0x89 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\xhash @ 1277]
0000008f`b72fd3a0 00007ffb`fc86bb1b rclcpp!rclcpp::Context::clean_up+0x5c [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\context.cpp @ 399]
0000008f`b72fd3d0 00007ffb`fc87f908 rclcpp!rclcpp::Context::~Context+0xab [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\context.cpp @ 161]
0000008f`b72ff5f0 00007ffb`fc87f9bc rclcpp!rclcpp::contexts::DefaultContext::~DefaultContext+0x28
0000008f`b72ff620 00007ffb`fc87f39f rclcpp!rclcpp::contexts::DefaultContext::`scalar deleting destructor'+0x2c
0000008f`b72ff650 00007ffb`fc87faff rclcpp!std::_Destroy_in_place<rclcpp::contexts::DefaultContext>+0x2f [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\xmemory @ 279]
0000008f`b72ff680 00007ffb`fc8546a7 rclcpp!std::_Ref_count_obj2<rclcpp::contexts::DefaultContext>::_Destroy+0x2f [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 1582]
0000008f`b72ff6b0 00007ffb`fc87fa38 rclcpp!std::_Ref_count_base::_Decref+0x47 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 644]
0000008f`b72ff6e0 00007ffb`fc87f8c8 rclcpp!std::_Ptr_base<rclcpp::contexts::DefaultContext>::_Decref+0x38 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 878]
0000008f`b72ff710 00007ffb`fcab73f1 rclcpp!std::shared_ptr<rclcpp::contexts::DefaultContext>::~shared_ptr<rclcpp::contexts::DefaultContext>+0x28 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\memory @ 1170]
0000008f`b72ff740 00007ffc`08af5107 rclcpp!`rclcpp::contexts::get_global_default_context'::`2'::`dynamic atexit destructor for 'default_context''+0x21
0000008f`b72ff770 00007ffc`08af4b15 ucrtbased!<lambda_d121dba8a4adeaf3a9819e48611155df>::operator()+0x127 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 206]
0000008f`b72ff800 00007ffc`08af4c4a ucrtbased!__crt_seh_guarded_call<int>::operator()<<lambda_6a47f4c8fd0152770a780fc1d70204eb>,<lambda_d121dba8a4adeaf3a9819e48611155df> &,<lambda_6aaa2265f5b6a89667e7d7630012e97a> >+0x35 [vccrt\vcruntime\inc\internal_shared.h @ 204]
0000008f`b72ff840 00007ffc`08af52b1 ucrtbased!__acrt_lock_and_call<<lambda_d121dba8a4adeaf3a9819e48611155df> >+0x4a [minkernel\crts\ucrt\inc\corecrt_internal.h @ 975]
0000008f`b72ff890 00007ffb`fca78089 ucrtbased!_execute_onexit_table+0x31 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 231]
0000008f`b72ff8d0 00007ffb`fca78cbc rclcpp!__scrt_dllmain_uninitialize_c+0x19 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\utility\utility.cpp @ 399]
0000008f`b72ff900 00007ffb`fca78a97 rclcpp!dllmain_crt_process_detach+0x4c [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 182]
0000008f`b72ff940 00007ffb`fca78e2e rclcpp!dllmain_crt_dispatch+0x67 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 220]
0000008f`b72ff980 00007ffb`fca78fa1 rclcpp!dllmain_dispatch+0xfe [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 293]
0000008f`b72ff9d0 00007ffc`25747cad rclcpp!_DllMainCRTStartup+0x31 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 335]
0000008f`b72ffa00 00007ffc`25773ff1 ntdll!LdrpCallInitRoutine+0x61
0000008f`b72ffa70 00007ffc`25773e8d ntdll!LdrShutdownProcess+0x141
0000008f`b72ffb70 00007ffc`2428e0ab ntdll!RtlExitUserProcess+0xad
0000008f`b72ffba0 00007ffc`08af467a KERNEL32!ExitProcessImplementation+0xb
0000008f`b72ffbd0 00007ffc`08af4629 ucrtbased!exit_or_terminate_process+0x3a [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 144]
0000008f`b72ffc00 00007ffc`08af49a6 ucrtbased!common_exit+0x79 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 282]
0000008f`b72ffc60 00007ff6`0ba559d7 ucrtbased!exit+0x16 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 294]
0000008f`b72ffc90 00007ff6`0ba5587e test_publisher!__scrt_common_main_seh+0x147 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 297]
0000008f`b72ffd00 00007ff6`0ba55b69 test_publisher!__scrt_common_main+0xe [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331]
0000008f`b72ffd30 00007ffc`24287034 test_publisher!mainCRTStartup+0x9 [d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17]
0000008f`b72ffd60 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
0000008f`b72ffd90 00000000`00000000 ntdll!RtlUserThreadStart+0x21
0:000>

@clalancette
Copy link
Contributor Author

I think this is actually a bug in the rclcpp tests (or perhaps rclcpp itself) that's causing trouble with rmw_cyclonedds_cpp because of global destructor ordering. Below is a slightly shortened version of a cdb log and it shows the global destructor(s) of rmw_cyclonedds_cpp running before rclcpp stops using the RMW interface. I think that's broken ...

Arg, yes, global destructor ordering. This has bitten us before.

The global state that is being destroyed is really not very interesting, just a few mutexes, an empty map and an empty set. Perhaps it would be an option to simply do nothing in Cdds::~Cdds() but I don't know how to do that, nor what the implications of that might be.

Yeah, that seems too hacky.

I figured I could avoid the problem by converting the global variable into std::atomic<std::shared_ptr<Cdds>> because that would make it possible to keep it alive until the last reference disappears. Unfortunately, that type got introduced in C++20 ...

I also couldn't make it work here with gcc, even with -std=c++20, but maybe I messed something up.

Another option could be to rewrite the tests to not rely on global destructors, but I don't know if that'd be considered an acceptable option.

I don't think that we should rely on that.

So, if anyone has a good suggestion ...

Two suggestions:

  1. Use the Construct on First Use Idiom for gcdds. Essentially that will just ensure that we never call the destructor of gcdds, which should fix the problem but will also show up as a memory "leak" to analyzers. It also has a non-zero overhead on every call, but that should be pretty minor.
  2. Revamp rmw_node.cpp to not need the global gcdds variable. I have a vague notion that we may be able to store a shared pointer to a heap-allocated Cdds structure somewhere else, but I'm not sure how feasible this is.

@clalancette
Copy link
Contributor Author

Another option:

  1. During rmw_init, allocate or take an additional reference to a heap-allocated Cdds structure. During rmw_shutdown, release the reference and destroy the structure if there are no references left. (There may be some thread safety issues to work out here)

@clalancette
Copy link
Contributor Author

OK, ros2/rosidl_typesupport#102 should solve most of the remaining issues on CentOS. I'm still not certain that is the way we want to go, but I'll gather additional opinions on that PR. Between that and ros2/rmw_cyclonedds#275 , I think we may be pretty close to getting this in.

@clalancette
Copy link
Contributor Author

clalancette commented Jan 7, 2021

All right. With ros2/rmw_cyclonedds#275 merged, we can look at the CI run in ros2/rmw_cyclonedds#275 (comment) again.

So I think the one remaining issue we have to deal with before merging this is the Linux test_graph__rmw_connext_cpp failure. I have so far been unable to reproduce it locally, but I'll keep trying. @eboasson any help with that is appreciated.

@eboasson
Copy link

eboasson commented Jan 7, 2021

So I think the one remaining issue we have to deal with before merging this is the Linux test_graph__rmw_connext_cpp failure. I have so far been unable to reproduce it locally, but I'll keep trying. @eboasson any help with that is appreciated.

A complication for me is that I don't have Connext and probably never will (RTI is quite sensitive about such things), so I can't try anything involving Connext myself.

Something that comes to mind that might play a role is the ros 2 daemon. If that one is perchance started prior to these tests and then remains running, it would be running in a slightly different environment with this PR in. I don't know when exactly that thing gets started, but you could try running the CI limiting testing to rcl only. That might be a datapoint.

@clalancette
Copy link
Contributor Author

Something that comes to mind that might play a role is the ros 2 daemon. If that one is perchance started prior to these tests and then remains running, it would be running in a slightly different environment with this PR in. I don't know when exactly that thing gets started, but you could try running the CI limiting testing to rcl only. That might be a datapoint.

Well, that is interesting. I ran a job just running the test_communication tests, and it did indeed succeed: https://ci.ros2.org/job/ci_linux/13371/ (ignore the flake8 errors; that has nothing to do with this PR).

So at the very least, it looks like there is some state being kept around and potentially causing problems. I'll add some code to kill off the daemon before running these tests, and see if that is any better when running the full suite.

@clalancette
Copy link
Contributor Author

clalancette commented Jan 8, 2021

It does look like killing the daemon before starting the tests improved things. Here is another CI with this PR and the contents of
https://github.com/ros2/system_tests/tree/clalancette/test-requester-debug . Assuming this is happy, I'll open a PR for that change, then I can convert this PR from a draft.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor Author

clalancette commented Jan 8, 2021

All right, one more mega-set of CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status
  • Windows Debug Build Status
  • Linux CentOS Build Status
  • Linux clang Build Status
  • Linux packaging (ros1_bridge) Build Status

@clalancette clalancette marked this pull request as ready for review January 8, 2021 23:05
@clalancette
Copy link
Contributor Author

Looking through the test failures in the last CI, most of them are known. However, after looking at it more carefully, there is one test failure in Windows which seems unique to this PR. That test failure is in rclcpp's test_node. What seems to happen is that all of tests complete successfully in a reasonable amount of time (about 1 second), but then the cleanup hangs until the timeout happens. @eboasson we will need to investigate that one, since something is clearly not right there.

Even so, I'm going to go ahead and merge this so that we can start doing testing beyond just the CI. Thanks for all of the work so far!

@clalancette clalancette merged commit eeb5859 into master Jan 12, 2021
@delete-merged-branch delete-merged-branch bot deleted the clalancette/change-default-rmw-to-cyclonedds branch January 12, 2021 01:00
@eboasson
Copy link

The test_node problem is caused by these two lines:
https://github.com/ros2/rclcpp/blob/56a037a3da15a7b3a5902a6cce2ef6b7c4aa8599/rclcpp/test/rclcpp/test_node.cpp#L71
and
https://github.com/ros2/rclcpp/blob/56a037a3da15a7b3a5902a6cce2ef6b7c4aa8599/rclcpp/test/rclcpp/test_node.cpp#L2525-L2554

Or more precisely, it's caused by Windows and those are the two lines that trigger it: in these two locations, the graph_listener thread defined at https://github.com/ros2/rclcpp/blob/56a037a3da15a7b3a5902a6cce2ef6b7c4aa8599/rclcpp/src/rclcpp/graph_listener.cpp#L109 is started, and it is then still blocked on a waitset by the time Windows decides that thread really shouldn't exist. Indiscriminately killing that thread from a global destructor then corrupts internal state of Cyclone and leaves it blocking on trying to delete the entity.

I'd say Windows needs to be eliminated for the good of humankind, or if that's impossible, fixed, or, assuming that too is an impossibility, those rclcpp tests need to be fixed because one can't support Windows and have the possibility that threads survive beyond main. But I don't quite know how to do that ... I suppose I should file an issue for rclcpp.

See below for a recording of a stripped down version of test_node, where everything but that very first call I mentioned has been commented out. Line 187 is the call to rcl_wait, line 188 the next line.

0:000> bp `graph_listener.cpp:187` "k;g"
*** WARNING: Unable to verify checksum for test_node.exe
*** WARNING: Unable to verify checksum for C:\dev\ros2_ws\install\bin\rclcpp.dll
0:000> bp `graph_listener.cpp:188` "k;g"
0:000> g
ModLoad: 00007ffc`243b0000 00007ffc`243e0000   C:\WINDOWS\System32\IMM32.DLL
Running main() from C:\dev\ros2_ws\install\src\gtest_vendor\src\gtest_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestNode
ModLoad: 00007ffb`f99b0000 00007ffb`f9afa000   C:\dev\ros2_ws\install\bin\rmw_cyclonedds_cpp.dll
ModLoad: 00007ffb`f5cc0000 00007ffb`f5f11000   C:\dev\ros2_ws\install\bin\ddsc.dll
ModLoad: 00007ffc`020c0000 00007ffc`0211c000   C:\dev\ros2_ws\install\bin\rmw_dds_common.dll
ModLoad: 00007ffc`1ae70000 00007ffc`1ae7f000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`036d0000 00007ffc`036e4000   C:\dev\ros2_ws\install\bin\rmw_dds_common__rosidl_typesupport_cpp.dll
ModLoad: 00007ffc`249e0000 00007ffc`24a4b000   C:\WINDOWS\System32\WS2_32.dll
ModLoad: 00007ffb`f2780000 00007ffb`f2ac9000   C:\WINDOWS\SYSTEM32\libcrypto-1_1-x64.dll
ModLoad: 00007ffc`25140000 00007ffc`2526b000   C:\WINDOWS\System32\RPCRT4.dll
ModLoad: 00007ffc`144a0000 00007ffc`144af000   C:\dev\ros2_ws\install\bin\rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffc`24010000 00007ffc`240bc000   C:\WINDOWS\System32\ADVAPI32.dll
ModLoad: 00007ffc`23690000 00007ffc`236b7000   C:\WINDOWS\System32\bcrypt.dll
ModLoad: 00007ffc`23820000 00007ffc`238be000   C:\WINDOWS\System32\msvcrt.dll
ModLoad: 00007ffc`25630000 00007ffc`256cc000   C:\WINDOWS\System32\sechost.dll
ModLoad: 00007ffc`01ef0000 00007ffc`01f9c000   C:\WINDOWS\SYSTEM32\libssl-1_1-x64.dll
ModLoad: 00007ffc`22260000 00007ffc`2229b000   C:\WINDOWS\SYSTEM32\IPHLPAPI.DLL
ModLoad: 00007ffc`036b0000 00007ffc`036c9000   C:\WINDOWS\SYSTEM32\VCRUNTIME140.dll
ModLoad: 000001e6`666e0000 000001e6`666f9000   C:\WINDOWS\SYSTEM32\VCRUNTIME140.dll
ModLoad: 00007ffc`23060000 00007ffc`230e0000   C:\WINDOWS\System32\bcryptprimitives.dll
[ RUN      ] TestNode.construction_and_destructionModLoad: 00007ffc`208a0000 00007ffc`208b2000   C:\WINDOWS\SYSTEM32\kernel.appcore.dll

ModLoad: 00007ffc`25620000 00007ffc`25628000   C:\WINDOWS\System32\NSI.dll
ModLoad: 00007ffc`18b90000 00007ffc`18ba7000   C:\WINDOWS\SYSTEM32\dhcpcsvc6.DLL
ModLoad: 00007ffc`18ad0000 00007ffc`18aed000   C:\WINDOWS\SYSTEM32\dhcpcsvc.DLL
ModLoad: 00007ffc`222a0000 00007ffc`2236b000   C:\WINDOWS\SYSTEM32\DNSAPI.dll
ModLoad: 00007ffc`22570000 00007ffc`225da000   C:\WINDOWS\system32\mswsock.dll
ModLoad: 00007ffc`19040000 00007ffc`19057000   C:\WINDOWS\system32\napinsp.dll
ModLoad: 00007ffc`190b0000 00007ffc`190cb000   C:\WINDOWS\system32\pnrpnsp.dll
ModLoad: 00007ffc`15fe0000 00007ffc`15ff5000   C:\WINDOWS\system32\wshbth.dll
ModLoad: 00007ffc`18e20000 00007ffc`18e3d000   C:\WINDOWS\system32\NLAapi.dll
ModLoad: 00007ffc`1ac50000 00007ffc`1ac62000   C:\WINDOWS\System32\winrnr.dll
ModLoad: 00007ffc`020a0000 00007ffc`020c0000   C:\dev\ros2_ws\install\bin\rmw_dds_common__rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`02080000 00007ffc`0209c000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffc`02060000 00007ffc`02073000   C:\dev\ros2_ws\install\bin\builtin_interfaces__rosidl_typesupport_introspection_c.dll
ModLoad: 00007ffc`00a90000 00007ffc`00ad2000   C:\dev\ros2_ws\install\bin\rcl_interfaces__rosidl_typesupport_introspection_cpp.dll
ModLoad: 00007ffc`00a70000 00007ffc`00a84000   C:\dev\ros2_ws\install\bin\builtin_interfaces__rosidl_typesupport_introspection_cpp.dll
Child-SP          RetAddr           Call Site
00000043`cd5fd800 00007ffb`f0e278c2 rclcpp!rclcpp::graph_listener::GraphListener::run_loop+0x407 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 187]
00000043`cd5fda60 00007ffb`f0e2a4de rclcpp!rclcpp::graph_listener::GraphListener::run+0x52 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 113]
00000043`cd5ffb80 00007ffb`f0e29ca5 rclcpp!std::invoke<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>+0x3e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\type_traits @ 1614]
00000043`cd5ffbb0 00007ffb`fc7d542c rclcpp!std::thread::_Invoke<std::tuple<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>,0,1>+0x75 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\thread @ 44]
00000043`cd5ffc10 00007ffc`24287034 ucrtbased!thread_start<unsigned int (__cdecl*)(void *),1>+0x9c [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97]
00000043`cd5ffc70 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000043`cd5ffca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Child-SP          RetAddr           Call Site
00000043`cd5fd800 00007ffb`f0e278c2 rclcpp!rclcpp::graph_listener::GraphListener::run_loop+0x429 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 188]
00000043`cd5fda60 00007ffb`f0e2a4de rclcpp!rclcpp::graph_listener::GraphListener::run+0x52 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 113]
00000043`cd5ffb80 00007ffb`f0e29ca5 rclcpp!std::invoke<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>+0x3e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\type_traits @ 1614]
00000043`cd5ffbb0 00007ffb`fc7d542c rclcpp!std::thread::_Invoke<std::tuple<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>,0,1>+0x75 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\thread @ 44]
00000043`cd5ffc10 00007ffc`24287034 ucrtbased!thread_start<unsigned int (__cdecl*)(void *),1>+0x9c [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97]
00000043`cd5ffc70 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000043`cd5ffca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Child-SP          RetAddr           Call Site
00000043`cd5fd800 00007ffb`f0e278c2 rclcpp!rclcpp::graph_listener::GraphListener::run_loop+0x407 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 187]
00000043`cd5fda60 00007ffb`f0e2a4de rclcpp!rclcpp::graph_listener::GraphListener::run+0x52 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 113]
00000043`cd5ffb80 00007ffb`f0e29ca5 rclcpp!std::invoke<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>+0x3e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\type_traits @ 1614]
00000043`cd5ffbb0 00007ffb`fc7d542c rclcpp!std::thread::_Invoke<std::tuple<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>,0,1>+0x75 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\thread @ 44]
00000043`cd5ffc10 00007ffc`24287034 ucrtbased!thread_start<unsigned int (__cdecl*)(void *),1>+0x9c [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97]
00000043`cd5ffc70 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000043`cd5ffca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Child-SP          RetAddr           Call Site
00000043`cd5fd800 00007ffb`f0e278c2 rclcpp!rclcpp::graph_listener::GraphListener::run_loop+0x429 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 188]
00000043`cd5fda60 00007ffb`f0e2a4de rclcpp!rclcpp::graph_listener::GraphListener::run+0x52 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 113]
00000043`cd5ffb80 00007ffb`f0e29ca5 rclcpp!std::invoke<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>+0x3e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\type_traits @ 1614]
00000043`cd5ffbb0 00007ffb`fc7d542c rclcpp!std::thread::_Invoke<std::tuple<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>,0,1>+0x75 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\thread @ 44]
00000043`cd5ffc10 00007ffc`24287034 ucrtbased!thread_start<unsigned int (__cdecl*)(void *),1>+0x9c [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97]
00000043`cd5ffc70 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000043`cd5ffca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Child-SP          RetAddr           Call Site
00000043`cd5fd800 00007ffb`f0e278c2 rclcpp!rclcpp::graph_listener::GraphListener::run_loop+0x407 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 187]
00000043`cd5fda60 00007ffb`f0e2a4de rclcpp!rclcpp::graph_listener::GraphListener::run+0x52 [C:\dev\ros2_ws\src\ros2\rclcpp\rclcpp\src\rclcpp\graph_listener.cpp @ 113]
00000043`cd5ffb80 00007ffb`f0e29ca5 rclcpp!std::invoke<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>+0x3e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\type_traits @ 1614]
00000043`cd5ffbb0 00007ffb`fc7d542c rclcpp!std::thread::_Invoke<std::tuple<void (__cdecl rclcpp::graph_listener::GraphListener::*)(void),rclcpp::graph_listener::GraphListener *>,0,1>+0x75 [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\thread @ 44]
00000043`cd5ffc10 00007ffc`24287034 ucrtbased!thread_start<unsigned int (__cdecl*)(void *),1>+0x9c [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97]
00000043`cd5ffc70 00007ffc`2575d0d1 KERNEL32!BaseThreadInitThunk+0x14
00000043`cd5ffca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
[       OK ] TestNode.construction_and_destruction (296 ms)
[----------] 1 test from TestNode (296 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (317 ms total)
[  PASSED  ] 1 test.
(1f50.2720): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffc`257afaa0 cc              int     3

@ivanpauno
Copy link
Member

those rclcpp tests need to be fixed because one can't support Windows and have the possibility that threads survive beyond main. But I don't quite know how to do that ... I suppose I should file an issue for rclcpp.

IIUC rclcpp::shutdown kills the graph listener thread, so adding that to the test should fix the issue (if not, the "default context" destructor will call shutdown, but that object has static lifetime and then you get this kind of problems ....)

@clalancette
Copy link
Contributor Author

@ivanpauno Thanks for the feedback. @eboasson and I discussed this, and he is going to give that a shot to see if it fixes the issue.

@wjwwood
Copy link
Member

wjwwood commented Jan 13, 2021

but that object has static lifetime and then you get this kind of problems

I'd like to understand this more. Can someone point me to some documentation or discussion about this buggy (insane?) behavior for Windows that it can just eliminate a thread before static objects which may or may not interact with it can be destroyed. That sounds broken to the point that it's not possible to me...

@ivanpauno
Copy link
Member

I don't know about this pull request, it sounds like we need to ensure this doesn't happen without an explicit shutdown.

Specifically, we allow and even encourage our users to not necessarily explicitly shutdown in their code and in our examples. So this would not work on Windows. It's not a very satisfying solution (requiring an explicit shutdown).

The best reference I've found is this and this.

See particularly steps 1. and 3. in the first link, step 1. is to kill threads, step 3. is to call global destructors (if I'm understanding things correctly, this is the kind of thing you need to re-read 10 times 😄 )

@wjwwood
Copy link
Member

wjwwood commented Jan 14, 2021

I'll read those tomorrow, but it seems really odd to me that we don't see problems elsewhere. Also, it's not obvious to me when ExitProcess is being called. Maybe it's called implicitly at the end of main? But I thought the main function wrapper did static object destruction... The documentation you linked to seems to be from the perspective that the user code calls that from within the main function or something. Similar to abort or exit/quick_exit.

@ivanpauno
Copy link
Member

Maybe it's called implicitly at the end of main? But I thought the main function wrapper did static object destruction...

My understanding is that ExitProcess is called at end of main, but I'm really not sure if that's the case (best reference saying that happens is this article).

But I thought the main function wrapper did static object destruction...

I think that the main wrapper may call destructors from static objects of the executable before calling ExitProcess, but destructors from static objects of loaded DLLs seem to be called by ExitProcess (which first kills all running threads).

To actually check what happens, we would need to check the generated object code or debug step by step process termination, but honestly I prefer not knowing the details and assume that I don't know when destructors of static lifetime objects will be run 😂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants