Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ 11 ] Resource temporarily unavailable #1347

Closed
niclar opened this issue May 4, 2022 · 15 comments · Fixed by #1349
Closed

[ 11 ] Resource temporarily unavailable #1347

niclar opened this issue May 4, 2022 · 15 comments · Fixed by #1349
Assignees

Comments

@niclar
Copy link
Contributor

niclar commented May 4, 2022

Hi, we just experienced our first roudi (v.2.0.0) outage. Ubuntu 20.04 LTS. clang 14

most of the clients worked but two of the publishing clients received the below, & the introspection program did not work, a machine restart was needed to resolve it;

Any pointers as to why ? Pub/sub is setup as;
publisherOptions.subscriberTooSlowPolicy = iox::popo::ConsumerTooSlowPolicy::WAIT_FOR_CONSUMER;
subscriberOptions.queueFullPolicy = iox::popo::QueueFullPolicy::BLOCK_PRODUCER;

/mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/bfd6602e5f-2435b68bfd.clean/iceoryx_hoofs/source/posix_wrapper/unix_domain_socket.cpp:249 { cxx::expected iox::posix::UnixDomainSocket::timedSend(const std::string &, const units::Duration &) const -> iox_sendto } ::: [ 11 ] Resource temporarily unavailable
2022-05-04 06:53:18.758 [ Fatal ]: Timeout registering at RouDi. Is RouDi running?
2022-05-04 06:53:18.759 [ Error ]: ICEORYX error! IPC_INTERFACE__REG_ROUDI_NOT_AVAILABLE
libc++abi: terminating

/Thanks

@elfenpiff
Copy link
Contributor

@niclar

The message originates right in the beginning when the application calls iox::runtime::PoshRuntime::initRuntime(APP_NAME);. Here the application registers at roudi and retrieves as answer all required resources. But in your case roudi did not answer. When roudi is not running the socket is not available and you should get an error message like: PoshError::IPC_INTERFACE__REG_UNABLE_TO_WRITE_TO_ROUDI_CHANNEL

So we have the situation that roudi is running since the socket is present but not answering - we have to understand why!

Here some questions:

  • Is it possible that in a previous run roudi crashed or was killed with SIGKILL, aka kill -9?
  • Next time you encounter this issue could you check if /tmp/roudi is still present? When removing this file with rm -rf /tmp/roudi solves the issue it is an indication that either roudi crashed or was killed with SIGKILL.
  • Is it possible that your system was under high cpu load? If so adjusting the priority of the iox-roudi process could help. Start roudi with nice -n -20 ./build/iox-roudi, -20 means highest priority. This could weaken the issue but then it would indicate that we may have a little design issue on our side.
  • Is it possible that a lot of apps are running?

Another possibility is that roudi was somehow blocked by the blocking policy, I will dig into this and let you know but it would be very helpful when you could provide me in the meantime some hints by answering the questions from above.

@elfenpiff
Copy link
Contributor

elfenpiff commented May 4, 2022

@niclar

I digged a little around and when a blocking publisher is unable to send data it enters a busy loop, which is perfect for latency but horrible for the CPU load. So when the subscriber is much slower than the publisher you should see a cpu load spike to 100% in a system monitor like htop whenever the publisher waits for the subscriber to process the sample.
I suspect that your problem may originate here.

Could you implement your system without blocking by decreasing the publisher frequency and increasing the subscriber queue size?

@niclar
Copy link
Contributor Author

niclar commented May 4, 2022

Thanks for the feedback!

  • might have been that roudi was SIGKILLed in a previous run on a machine restart
  • after the a machine restart I now have new /tmp/roudi & /tmp/roudi.pid ...
  • re machine load , I don't think so, maybe that one core spiked, at the time .but when trying to start the client process anew the same error popped up , cpu load would have been much lower at that point ( 50% )
  • I will add "nice -n -20 ./build/iox-roudi" if it repeats itself.
  • running 22 apps at the moment. 2 consumers and the rest producers. Producer throughput is gradually increasing.. but all has been stable since inception (v2 release in mars)

Requirements are that publisher frequency is fixed, data can't be lost and consumers must keep up. If we get to a blocking halt something is wrong and thats a critical error. I wouldn't see the above symptoms in that case would I ?

@elfenpiff
Copy link
Contributor

@niclar

I think the 20 producer are the cause of your issue. I will implement a smarter waiting mechanism in the next days to solve this problem once and for all.

I will ping you when the PR is out.

The nice -n -20 ./build/iox-roudi should be only a temporary solution but for the time being it could solve your problem. But it can cause other problems for instance that all the remaining apps run much slower.

@elfenpiff elfenpiff self-assigned this May 4, 2022
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022
…lures from duration

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022
…readme and release notes

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022
…readme and release notes

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022
…readme and release notes

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022
…readme and release notes

Signed-off-by: Christian Eltzschig <me@elchris.org>
@niclar
Copy link
Contributor Author

niclar commented May 6, 2022

we experienced the very same issue this morning (10publishers 1subscriber) and nothing (rm -r /tmp/roudi & restarting roudi, subscribers and publishers) but a reboot solved it.

@elfenpiff
Copy link
Contributor

@niclar The PR #1349 should fix this issue. At the moment it requires some fine tuning but I think that it should be merged next week.

@niclar
Copy link
Contributor Author

niclar commented May 6, 2022

@elfenpiff many thanks. I'll monitor it

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 6, 2022
… much simpler

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 6, 2022
… much simpler

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 13, 2022
…atory doxygen comments to adaptive_wait and duration. Add test to verify increasing wait of adaptive_wait

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 13, 2022
…atory doxygen comments to adaptive_wait and duration. Add test to verify increasing wait of adaptive_wait

Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 16, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 16, 2022
Signed-off-by: Christian Eltzschig <me@elchris.org>
elfenpiff added a commit that referenced this issue May 17, 2022
@elfenpiff
Copy link
Contributor

@niclar Could you please try your setup with the newest master. Your problem should be solved now, if not please reopen this issue again.

@niclar
Copy link
Contributor Author

niclar commented May 18, 2022

@elfenpiff I can't seem to be able to reopen this issue. But we just experienced this issue again with the HEAD 7ef8462.

Publisher:
/mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/a94b9a1d71-f35279b8e1.clean/iceoryx_hoofs/source/posix_wrapper/unix_domain_socket.cpp:249 { cxx::expected iox::posix::UnixDomainSocket::timedSend(const std::string &, const units::Duration &) const -> iox_sendto } ::: [ 11 ] Resource temporarily unavailable
2022-05-18 11:37:44.260 [ Fatal ]: Timeout registering at RouDi. Is RouDi running?
2022-05-18 11:37:44.260 [ Error ]: ICEORYX error! IPC_INTERFACE__REG_ROUDI_NOT_AVAILABLE
libc++abi: terminating

-we had a few "Version mismatch" in RouDi earlier today, but that's fixed, and shouldn't matter I reckon.

-No more publishers can join and introspection also bails

  • ~30% cpu util

@elfenpiff elfenpiff reopened this May 18, 2022
@elfenpiff
Copy link
Contributor

@niclar It may be possible that some weird high cpu load is somewhere present. Could you please send me the output of htop at the moment when this occurs?

@elfenpiff
Copy link
Contributor

@niclar

Another problem could be that you have a lot of applications and you try to start them all at once. Then your system load increases suddenly and RouDi does not get enough cpu time to answer all of your requests.

One simple solution could be to start all applications sequentially with one or two seconds sleep in between. Then all applications should have enough time to register and roudi should get enough cpu time to handle them.

Furthermore, could you please start roudi with ./build/iox-roudi -l debug and print the output. Maybe there is an issue which we are overlooking.

@niclar
Copy link
Contributor Author

niclar commented Jun 1, 2022

@elfenpiff starting roudi with the highest priority seems to have remedied the issue.

I'll close the issue (and re-open it with debug output if we encounter it again)

Thanks for your support

@niclar niclar closed this as completed Jun 1, 2022
@ciandonovan
Copy link

This issue occurs for us on some nodes but not others, but as soon as it occurs on one all become non-functional.

It seems to happen more-so when we autostart the set of nodes at boot, beginning execution after multi-user.target at the default user session target. Manually restarting iox-roudi along with all the nodes sometimes solves the issue, but it is in no way deterministic.

Have also tried adjusting the memory pools, but iox-introspection shows no pool exhausted.

Also built iox-roudi with various permutations of the build flags https://github.com/eclipse-iceoryx/iceoryx/blob/master/doc/website/advanced/configuration-guide.md, which solved some issues with port exhaustion, but no combination made this issue reliably go away. Further, no matter how it was tweaked, warnings about too many chunks being held in parallel persisted.

What do you suspect the root cause of this issue with Roudi and the domain sockets is? I saw a suggestion about changing Roudi's priority, but even if that worked, it strikes me as incredibly non-deterministic, which unfortunately negates some of the key touted advantages of Iceoryx.

For context we run the ROS2 Nav2 stack, along with 4 Intel Realsense cameras, 4 other additional cameras, and a few other low-bandwidth nodes. The issue does seem to happen less frequently when not running the Nav2 stack, I suspect because it's quite heavy on pub/sub connections.

@niclar
Copy link
Contributor Author

niclar commented Nov 22, 2023

@ciandonovan, "too many chunks being held in parallel persisted." -is a different issue/limit, change IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY and/or IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY respectively

@ciandonovan
Copy link

Increasing IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY from the default 8 to 16 causes publishing and subscribing to silently hang, although listing topics still works fine interestingly.

Increasing IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY from the default 256 to 512 causes a similar issue with [Warning]: Application iceoryx_rt_48_1700675369174225383 not responding.

Resetting both to their defaults restores basic functionality.

Could be an issue with the RMW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants