-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActionClient Crash: Taking data from action client but nothing is ready #2242
Comments
Could you prepare a minimal example that one could build and run to reproduce the error? It would also provide clarity on how various callbacks are defined. |
I tried to create a minimal example, and figured out, that the cause of this issue was most likely due to deadlock issues in the client or the server. I can not reproduce the issue as original described in the minimal example. However I managed reproduce the error now with a slightly different setup. https://github.com/jmachowinski/rclcppActionDeadlock branch nothingReady start server Server:
client:
start the client client just runs server directly crashes with
Note, you need to do this multiple times, about 1 time out of 10 the error will trigger. |
Okay, I had a look at the code now and it it pretty obvious that a race condition is ongoing. rcl_action_server_wait_set_get_entities_ready or repectitive rcl_wait From the documentation of this function, it is unclear to me if it may return multiple events to be ready at once. |
I dug further into this issue, and from my tests I can see, that rcl_action_server_wait_set_get_entities_ready may return multiple events at one time. Now I am a bit more lost on how to correctly map this to the waitable interface, as is_ready can not signal that multiple events are ready. The only way I see currently is to keep track of incoming events, as reported events through the waitable interface, which is somewhat cumbersome. Any ideas ? |
I can reproduce the following exception on server side. terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): goal_handle attempted invalid transition from state EXECUTING with event CANCELED, at ./src/rcl_action/goal_handle.c:95
Aborted (core dumped) i believe this is because i think what expected here is, user callback should decide if accepted or not only, then process canceling sequence in the application then call btw, the following patch is required to build with diff --git a/CMakeLists.txt b/CMakeLists.txt
index 59c6de8..4c509bc 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,5 +25,9 @@ add_executable(client src/Client.cpp)
ament_target_dependencies(client PUBLIC rclcpp rclcpp_action)
target_link_libraries(client PRIVATE ${cpp_typesupport_target})
+install(TARGETS
+ server
+ client
+ DESTINATION lib/${PROJECT_NAME})
ament_package()
diff --git a/package.xml b/package.xml
index 6b65995..f865ad0 100644
--- a/package.xml
+++ b/package.xml
@@ -9,6 +9,9 @@
<buildtool_depend>ament_cmake</buildtool_depend>
+ <depend>rclcpp</depend>
+ <depend>rclcpp_action</depend>
+
<test_depend>ament_lint_auto</test_depend>
<test_depend>ament_lint_common</test_depend>
<buildtool_depend>rosidl_default_generators</buildtool_depend> gdb core stacktrace#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140540045682240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007fd22dc85476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007fd22dc6b7f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007fd22df2dbbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fd22df3924c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fd22df392b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007fd22df3923e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007fd22ec7a8d9 in rclcpp::exceptions::throw_from_rcl_error (ret=2301, prefix="", error_state=0x0,
reset_error=0x7fd22e0ce794 <rcutils_reset_error>) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/exceptions/exceptions.cpp:90
#10 0x00007fd22f1cbc8a in rclcpp_action::ServerGoalHandleBase::_canceled (this=0x7fd1ec001130)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server_goal_handle.cpp:96
#11 0x00005646ab15d534 in rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>::canceled (this=0x7fd1ec001130,
result_msg=std::shared_ptr<deadlock::action::SomeAction_Result_<std::allocator<void> >> (use count 2, weak count 0) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server_goal_handle.hpp:210
#12 0x00005646ab15b683 in ServerNode::cancelGoal (this=0x5646ac6fef60,
goalHandle=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (use count 3, weak count 1) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcppActionDeadlock/src/Server.cpp:59
#13 0x00005646ab1663a5 in std::__invoke_impl<rclcpp_action::CancelResponse, rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
__f=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>,
__t=@0x5646aca6acb0: 0x5646ac6fef60) at /usr/include/c++/11/bits/invoke.h:74
#14 0x00005646ab165043 in std::__invoke<rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
__fn=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>)
at /usr/include/c++/11/bits/invoke.h:96
#15 0x00005646ab163d10 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::__call<rclcpp_action::CancelResponse, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&, 0ul, 1ul>(std::tuple<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&>&&, std::_Index_tuple<0ul, 1ul>) (
this=0x5646aca6aca0, __args=...) at /usr/include/c++/11/functional:420
#16 0x00005646ab162fa0 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >, rclcpp_action::CancelResponse>(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (this=0x5646aca6aca0)
at /usr/include/c++/11/functional:503
#17 0x00005646ab162353 in std::__invoke_impl<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::__invoke_other, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
__f=...) at /usr/include/c++/11/bits/invoke.h:61
#18 0x00005646ab160929 in std::__invoke_r<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (__fn=...)
--Type <RET> for more, q to quit, c to continue without paging--
at /usr/include/c++/11/bits/invoke.h:114
#19 0x00005646ab15e724 in std::_Function_handler<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
__functor=..., __args#0=...) at /usr/include/c++/11/bits/std_function.h:290
#20 0x00005646ab16b335 in std::function<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >) const (this=0x5646ac84ba60,
__args#0=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (empty) = {...})
at /usr/include/c++/11/bits/std_function.h:590
#21 0x00005646ab169b66 in rclcpp_action::Server<deadlock::action::SomeAction>::call_handle_cancel_callback (this=0x5646ac84b9b0, uuid=...)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server.hpp:439
#22 0x00007fd22f1b6fde in rclcpp_action::ServerBase::execute_cancel_request_received (this=0x5646ac84b9b0,
data=std::shared_ptr<void> (use count 2, weak count 0) = {...}) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:459
#23 0x00007fd22f1b5be4 in rclcpp_action::ServerBase::execute (this=0x5646ac84b9b0, data=std::shared_ptr<void> (use count 2, weak count 0) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:299
#24 0x00007fd22ec84aec in rclcpp::Executor::execute_any_executable (this=0x7fff63e3e250, any_exec=...)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:546
#25 0x00007fd22ecafc51 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7fff63e3e250, this_thread_number=9)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:100
#26 0x00007fd22ecb1cb7 in std::__invoke_impl<void, void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
__f=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>, __t=@0x5646aca6dab0: 0x7fff63e3e250) at /usr/include/c++/11/bits/invoke.h:74
#27 0x00007fd22ecb1bc9 in std::__invoke<void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
__fn=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>) at /usr/include/c++/11/bits/invoke.h:96
#28 0x00007fd22ecb1ad2 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0x5646aca6da98, __args=...) at /usr/include/c++/11/functional:420
#29 0x00007fd22ecb1a15 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::operator()<, void>() (this=0x5646aca6da98) at /usr/include/c++/11/functional:503
#30 0x00007fd22ecb19bc in std::__invoke_impl<void, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::__invoke_other, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#31 0x00007fd22ecb1965 in std::__invoke<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#32 0x00007fd22ecb1906 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:253
#33 0x00007fd22ecb18d6 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::operator()() (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:260
#34 0x00007fd22ecb18b6 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > > >::_M_run() (this=0x5646aca6da90) at /usr/include/c++/11/bits/std_thread.h:211
#35 0x00007fd22df672b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#36 0x00007fd22dcd7b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#37 0x00007fd22dd69a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 |
@jmachowinski i think doc section is not correct, see #2266 |
@fujitatomoya I agree. |
I think
on this one, i cannot reproduce this one, can you elaborate what the root case of this before starting review? that would be really appreciated to understand the problem 1st. |
This problem is hard to reproduce, you need a certain workload and a actions running with a multithreaded spinner. As for the explanation of what is going on, this is a long one... First we need to look at the multithreaded executor: rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp Lines 82 to 100 in e296583
The important point here is that get_next_executable is called with a lock and execute_any_executable(any_exec) is called without the lock. Now we need to look at the function rclcpp/rclcpp_action/src/client.cpp Lines 266 to 274 in e296583
This function will be called by get_next_executable deep under the hood. The important point here is that the kind of next executable is stored in the global pimpl. Next is the function rclcpp/rclcpp_action/src/client.cpp Line 622 in e296583
This function is called by execute_any_executable. As pointed out above, this call is not protected any more by a lock, and any other thread may have been calling ClientBase::is_ready in-between. As one can see, the execute function uses the values in the global pimpl object to determine the kind of action that needs processing. This it the point were the race occurs, as the global pimpl may have been altered by another thread, leading to the "Executing action client but nothing is ready" exception. if (pimpl_->is_feedback_ready) {
} else if (pimpl_->is_status_ready) {
} else if (pimpl_->is_goal_response_ready) {
} else if (pimpl_->is_result_response_ready) {
} else if (pimpl_->is_cancel_response_ready) {
} else {
throw std::runtime_error("Executing action client but nothing is ready");
} Note, I also saw races between is_ready and take_data. Which are a mystery to me, as this should be protected by the mutex in the spinner. The patch also addresses these. |
@fujitatomoya is the issue now clear to you ? |
Seeing a simialr error on a code that was running fine under |
Hi @jmachowinski , thank you for your digging and elaboration on the issue. We have been encountering it in our project since april, randomly in some 1/10 cases and seemingly dependent on the computer utilization. I am now testing our application with your fix and will let you know once I will be able to determine if it helps. |
@jmachowinski we ran into the same issue and it's kind of a blocker for us due to its quite consistent reproduction. Your explanation here was very useful to bootstrap the investigation we have started as well however there is one thing we cannot puzzle - originally you talked about runtime exception with a message "Taking data from action client but nothing is ready" (and we get exactly the same error) but your explanation leads to "Executing action client but nothing is ready" which seems to be different (at least we haven't seen that one yet). Does it mean that ones you fixed the race between |
@medvedevigorek Its a race between the tree function. is_ready() sets class members, which represent which data must be taken. In take_data() theses members are actually not cleared, as they are 'misused' to store the type of data that was taken. Later in execute the class members are used to figure out what kind of data was taken, and the class members are cleared. This leads to the three possible failures:
Note that #2250 is a working fix |
Thank you, @jmachowinski , for additional details. After getting the race condition caused by Quick overview of involved functions:
Now what sequence leads to the multiple calls of
So the actual problem must be in AllocatorMemoryStrategy as it blindly adds ready waitables to the triggered list. rclcpp/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp Lines 123 to 127 in e296583
I ended up with a simple change to add a waitable only if it's not present in the triggered list and it fixed the problem for me. This is in addition to a change in I also tried your fix and works as well as it makes Not sure what else might be impacted by the flow described above besides action client base - don't know what else uses Waitable interface in rclcpp. |
@medvedevigorek Nice debugging. I agree, this a second bug we are hitting here. |
@jmachowinski, I was thinking about clearing |
@medvedevigorek when you say this, do you mean it's not in a group (I assumed you mean callback group) anymore because it was deleted? Or for some other reason? |
@wjwwood Its not added by
|
@wjwwood yes, I meant callback group. |
Fixed by the merge of #2495 in jazzy |
Thank you for your efforts in fixing this. Is there a possibility to backport it to humble? 🙏 |
@alexmillane unfortunately i do not think we can do that, since #2495 is API/ABI break change. |
Its pretty much straight forward to modify the patch to be non ABI/API breaking. @alexmillane feel free to do it ;-) |
Looking at #2250 it seems this was applied to rolling/jazzy. Has anyone tried to apply to iron? Does the patch apply cleanly? Thanks! PS. Or can the (patched) rolling rclcpp package be used with an iron distro? |
@jmachowinski Could you eleborate how? I am willing to put in the effort into backporting this patch to iron as it basically makes it impossible to run any ros2control application for a longer period when using any controller with an action interface. |
Use the patch from #2495 and change it, so that the signature of execute_goal_request_received(const std::shared_ptr<void> & data);
execute_cancel_request_received(const std::shared_ptr<void> & data);
execute_result_request_received(const std::shared_ptr<void> & data); and the enum EntityType are not changed. Should be straight forward. |
Hello @alexmillane and @doisyg, I just backported the fix from |
Bug report
Required Info:
Steps to reproduce issue
Start action server
Start action client
wait
Expected behavior
no crash
Actual behavior
after some time we get the exception
terminate called after throwing an instance of 'std::runtime_error' what(): Taking data from action client but nothing is ready
Additional information
Using a multi threaded spinner, but the default callback group, I guess this one is mutual exclusive
The text was updated successfully, but these errors were encountered: