Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

New windows debug exceptions #19

Closed
dhood opened this issue May 1, 2017 · 2 comments
Closed

New windows debug exceptions #19

dhood opened this issue May 1, 2017 · 2 comments
Labels
bug Something isn't working diagnosed

Comments

@dhood
Copy link
Member

dhood commented May 1, 2017

Lots of new failures on windows debug nightlies as of 8 April.

Impacted tests

This is the list of new failures (there are many other failures on these jobs for other reasons)

projectroot.gtest_client_wait_for_service_shutdown__rmw_fastrtps_cpp
projectroot.gtest_executor__rmw_fastrtps_cpp
projectroot.gtest_intra_process__rmw_fastrtps_cpp
projectroot.gtest_local_parameters__rmw_fastrtps_cpp
projectroot.gtest_multiple_service_calls__rmw_fastrtps_cpp
projectroot.gtest_multithreaded__rmw_fastrtps_cpp
projectroot.gtest_publisher__rmw_fastrtps_cpp
projectroot.gtest_repeated_publisher_subscriber__rmw_fastrtps_cpp
projectroot.gtest_services_in_constructor__rmw_fastrtps_cpp
projectroot.gtest_spin__rmw_fastrtps_cpp
projectroot.gtest_subscription__rmw_fastrtps_cpp
projectroot.gtest_timeout_subscriber__rmw_fastrtps_cpp
projectroot.gtest_timer__rmw_fastrtps_cpp
projectroot.test_callback_exceptions
projectroot.test_client__rmw_fastrtps_cpp
projectroot.test_composition__rmw_fastrtps_cpp
projectroot.test_default_state_machine
projectroot.test_externally_defined_services
projectroot.test_find_weak_nodes
projectroot.test_get_node_names__rmw_fastrtps_cpp
projectroot.test_lifecycle_node
projectroot.test_messages_c__rmw_fastrtps_cpp
projectroot.test_multiple_instances
projectroot.test_node__rmw_fastrtps_cpp
projectroot.test_publisher__rmw_fastrtps_cpp
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__BoundedArrayNested
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__BoundedArrayPrimitives
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__Builtins
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__DynamicArrayNested
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__DynamicArrayPrimitives
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__Empty
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__Nested
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__Primitives
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__StaticArrayNested
projectroot.test_publisher_subscriber_cpp__rmw_fastrtps_cpp__StaticArrayPrimitives
projectroot.test_service__rmw_fastrtps_cpp
projectroot.test_state_machine_info
projectroot.test_subscription__rmw_fastrtps_cpp
projectroot.test_subscription_valid_data_cpp__rmw_fastrtps_cpp
projectroot.test_tutorial_add_two_ints_server_add_two_ints_client__rmw_fastrtps_cpp
projectroot.test_tutorial_add_two_ints_server_add_two_ints_client_async__rmw_fastrtps_cpp
service_client__rmw_fastrtps_cpp.wait_for_service_shutdown
test_add_two_ints_server_add_two_ints_client__rmw_fastrtps_cpp_Debug.test_executable
test_add_two_ints_server_add_two_ints_client_async__rmw_fastrtps_cpp_Debug.test_executable
test_composition__rmw_fastrtps_cpp_Debug.test_api_pubsub_composition
test_composition__rmw_fastrtps_cpp_Debug.test_api_srv_composition
test_executor__rmw_fastrtps_cpp.multiple_executors
test_executor__rmw_fastrtps_cpp.multithreaded_spin_call
test_executor__rmw_fastrtps_cpp.notify
test_executor__rmw_fastrtps_cpp.recursive_spin_call
test_intra_process_within_one_node__rmw_fastrtps_cpp.nominal_usage
test_local_parameters__rmw_fastrtps_cpp.get_from_node_primitive_type
test_local_parameters__rmw_fastrtps_cpp.get_from_node_variant_type
test_local_parameters__rmw_fastrtps_cpp.get_parameter_or
test_local_parameters__rmw_fastrtps_cpp.helpers
test_local_parameters__rmw_fastrtps_cpp.local_asynchronous
test_local_parameters__rmw_fastrtps_cpp.local_synchronous
test_local_parameters__rmw_fastrtps_cpp.local_synchronous_repeated
test_local_parameters__rmw_fastrtps_cpp.set_parameter_if_not_set
test_multiple_service_calls__rmw_fastrtps_cpp.multiple_clients
test_multithreaded__rmw_fastrtps_cpp.multi_access_publisher
test_multithreaded__rmw_fastrtps_cpp.multi_access_publisher_intra_process
test_multithreaded__rmw_fastrtps_cpp.multi_consumer_clients
test_multithreaded__rmw_fastrtps_cpp.multi_consumer_intra_process
test_multithreaded__rmw_fastrtps_cpp.multi_consumer_single_producer
test_publisher__rmw_fastrtps_cpp.publish_with_const_reference


Most only impact the rmw_fastrtps variant of tests, but some are not run for each rmw implementation. The ones that don't have fastrtps in the name are:
projectroot.test_callback_exceptions
projectroot.test_default_state_machine
projectroot.test_externally_defined_services
projectroot.test_find_weak_nodes
projectroot.test_lifecycle_node
projectroot.test_multiple_instances
projectroot.test_state_machine_info

Without digging into the details of these tests I would think it's because they're using the default RMW implementation of rmw_fastrtps, given how rmw_connext_cpp tests seem unaffected. If these tests are actually rmw-agnostic could someone please point it out.

Symptom
They most common error output is an SEH exception (e.g. http://ci.ros2.org/view/nightly/job/nightly_win_deb/435/testReport/(root)/projectroot/test_find_weak_nodes/):

-- run_test.py: invoking following command in 'C:\J\workspace\nightly_win_deb\ws\src\ros2\rclcpp\rclcpp':
 - C:/J/workspace/nightly_win_deb/ws/build/rclcpp/Debug/test_find_weak_nodes.exe --gtest_output=xml:C:/J/workspace/nightly_win_deb/ws/build/rclcpp/test_results/rclcpp/test_find_weak_nodes.gtest.xml
Running main() from gtest_main.cc
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from TestFindWeakNodes
[ RUN      ] TestFindWeakNodes.allocator_strategy_with_weak_nodes
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
[  FAILED  ] TestFindWeakNodes.allocator_strategy_with_weak_nodes (70 ms)
[ RUN      ] TestFindWeakNodes.allocator_strategy_no_weak_nodes
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
[  FAILED  ] TestFindWeakNodes.allocator_strategy_no_weak_nodes (73 ms)
[----------] 2 tests from TestFindWeakNodes (143 ms total)

but sometimes it's just bad return codes e.g. http://ci.ros2.org/view/nightly/job/nightly_win_deb/435/testReport/(root)/projectroot/test_tutorial_add_two_ints_server_add_two_ints_client__rmw_fastrtps_cpp/

launch_testing.UnmatchedOutputError: Example output ([]) does not match expected output ([b'Result of add_two_ints: 5'])
-------------------- >> begin captured stdout << ---------------------
(test_executable_0) pid 8616: ['C:/J/workspace/nightly_win_deb/ws/build/demo_nodes_cpp/Debug/add_two_ints_server.exe', 'test_executable'] (all > console, InMemoryHandler: test_executable_0)
(test_executable_1) pid 12648: ['C:/J/workspace/nightly_win_deb/ws/build/demo_nodes_cpp/Debug/add_two_ints_client.exe', 'test_executable'] (all > console, InMemoryHandler: test_executable_1)
[test_executable_0] Incoming request
[test_executable_0] a: 2 b: 3
(test_executable_1) rc 3221225477
() tear down
(test_executable_0) signal SIGTERM
(test_executable_0) rc 1

Investigations
At the time we were still using a pinned version of fastrtps, so I would rule out any changes on that repo. There also don't seem to have been any work on relevant windows buildfarm machines then (eatable was touch on the 7th but it hasn't run debug nightlies since mach 29). These are the PRs that were merged on our repos around that time: https://github.com/search?p=1&q=user%3Aament+user%3Aros2+merged%3A%222017-04-06T22%3A00%3A00-08%3A00+..+2017-04-08T15%3A30%3A00-08%3A00%22&type=Issues&utf8=%E2%9C%93

It's almost all the namespace implementation. There were two other PRs in ament_tools, but even resetting the head of ament_tools to before those two PRs were merged, the tests still fail: http://ci.ros2.org/job/ci_windows/2679/#showFailuresLink

Conclusion
This seems to be caused by the changes to support namespaces in ros2/rmw#95 or one of the related PRs. Not all tests fail, (e.g. this pub-sub passes http://ci.ros2.org/job/ci_windows/2679/testReport/(root)/test_publisher_subscriber__StaticArrayNested__rclcpp__rmw_connext_cpp__rmw_fastrtps_cpp_Debug/).

My suspicion is that it's only tests that are creating nodes with anything more than the node name passed. This is because only the test_rclcpp tests modified in https://github.com/ros2/system_tests/pull/196/files were affected. @wjwwood does it seem plausible that those PRs introduced this? (I may have missed a possibility and I wouldn't want to declare an incorrect cause)

@dhood dhood added the diagnosed label May 2, 2017
@wjwwood
Copy link
Member

wjwwood commented May 2, 2017

I don't see how the changes in ros2/rmw#95 could have caused these failures. I'll spin up a Windows VM to check why it these tests are failing though.

@wjwwood
Copy link
Member

wjwwood commented May 3, 2017

Getting it running in Visual Studio, this is what I get (rmw_fastrtps/rmw_fastrtps_cpp/src/functions.cpp:1329):

Exception thrown: read access violation.
guard_condition was 0xFFFFFFFFFFFFFFF7.

This is the call stack:

rmw_fastrtps_cpp.dll!rmw_destroy_guard_condition(rmw_guard_condition_t * guard_condition)
  Line: 1329
  Language: C++
rmw_fastrtps_cpp.dll!rmw_destroy_node(rmw_node_t * node)
  Line: 774
  Language: C++
rcl.dll!rcl_node_fini(rcl_node_t * node)
  Line: 252
  Language: C
rclcpp.dll!rclcpp::node_interfaces::NodeBase::{ctor}::__l2::<lambda>(rcl_node_t * node)
  Line: 97
  Language: C++
rclcpp.dll!std::_Ref_count_del<rcl_node_t,void <lambda>(rcl_node_t *) >::_Destroy()
  Line: 610
  Language: C++
rclcpp.dll!std::_Ref_count_base::_Decref()
  Line: 538
  Language: C++
rclcpp.dll!std::_Ptr_base<rcl_node_t>::_Decref()
  Line: 762
  Language: C++
rclcpp.dll!std::shared_ptr<rcl_node_t>::~shared_ptr<rcl_node_t>()
  Line: 1000
  Language: C++
rclcpp.dll!rclcpp::node_interfaces::NodeBase::~NodeBase()
  Line: 123
  Language: C++

I don't see anything in that call stack that would have been affected by my changes. The only thing I could imagine is that the guard_condition pointer got overwritten by some array overrun. In fact, I'm not even sure any of my functions added in those prs are called in this test. Certainly I added an argument to a class constructor, but I don't see how that would cause this.

I guess we're dealing with heap overwrite, but according to this page it is usually Debug -> Release where these issues start popping up:

https://msdn.microsoft.com/en-us/library/aa236698%28v=vs.60%29.aspx?f=255&MSPPError=-2147217396#_core_heap_layout

Maybe we were just getting lucky in Release and in Debug we're finding the issues.

I don't know where to go from here, but I guess setting up watch points for memory and checking the memory state as you walk through the executable. I don't have time to do that right now.

I also audited my code changes and I flagged these spots as potentially dangerous, but I couldn't find that any of them were the root cause:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working diagnosed
Projects
None yet
Development

No branches or pull requests

3 participants