[ 11 ] Resource temporarily unavailable #1347

niclar · 2022-05-04T07:41:04Z

Hi, we just experienced our first roudi (v.2.0.0) outage. Ubuntu 20.04 LTS. clang 14

most of the clients worked but two of the publishing clients received the below, & the introspection program did not work, a machine restart was needed to resolve it;

Any pointers as to why ? Pub/sub is setup as;
publisherOptions.subscriberTooSlowPolicy = iox::popo::ConsumerTooSlowPolicy::WAIT_FOR_CONSUMER;
subscriberOptions.queueFullPolicy = iox::popo::QueueFullPolicy::BLOCK_PRODUCER;

/mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/bfd6602e5f-2435b68bfd.clean/iceoryx_hoofs/source/posix_wrapper/unix_domain_socket.cpp:249 { cxx::expected iox::posix::UnixDomainSocket::timedSend(const std::string &, const units::Duration &) const -> iox_sendto } ::: [ 11 ] Resource temporarily unavailable
2022-05-04 06:53:18.758 [ Fatal ]: Timeout registering at RouDi. Is RouDi running?
2022-05-04 06:53:18.759 [ Error ]: ICEORYX error! IPC_INTERFACE__REG_ROUDI_NOT_AVAILABLE
libc++abi: terminating

/Thanks

The text was updated successfully, but these errors were encountered:

elfenpiff · 2022-05-04T09:13:26Z

@niclar

The message originates right in the beginning when the application calls iox::runtime::PoshRuntime::initRuntime(APP_NAME);. Here the application registers at roudi and retrieves as answer all required resources. But in your case roudi did not answer. When roudi is not running the socket is not available and you should get an error message like: PoshError::IPC_INTERFACE__REG_UNABLE_TO_WRITE_TO_ROUDI_CHANNEL

So we have the situation that roudi is running since the socket is present but not answering - we have to understand why!

Here some questions:

Is it possible that in a previous run roudi crashed or was killed with SIGKILL, aka kill -9?
Next time you encounter this issue could you check if /tmp/roudi is still present? When removing this file with rm -rf /tmp/roudi solves the issue it is an indication that either roudi crashed or was killed with SIGKILL.
Is it possible that your system was under high cpu load? If so adjusting the priority of the iox-roudi process could help. Start roudi with nice -n -20 ./build/iox-roudi, -20 means highest priority. This could weaken the issue but then it would indicate that we may have a little design issue on our side.
Is it possible that a lot of apps are running?

Another possibility is that roudi was somehow blocked by the blocking policy, I will dig into this and let you know but it would be very helpful when you could provide me in the meantime some hints by answering the questions from above.

elfenpiff · 2022-05-04T09:28:32Z

@niclar

I digged a little around and when a blocking publisher is unable to send data it enters a busy loop, which is perfect for latency but horrible for the CPU load. So when the subscriber is much slower than the publisher you should see a cpu load spike to 100% in a system monitor like htop whenever the publisher waits for the subscriber to process the sample.
I suspect that your problem may originate here.

Could you implement your system without blocking by decreasing the publisher frequency and increasing the subscriber queue size?

niclar · 2022-05-04T09:45:50Z

Thanks for the feedback!

might have been that roudi was SIGKILLed in a previous run on a machine restart
after the a machine restart I now have new /tmp/roudi & /tmp/roudi.pid ...
re machine load , I don't think so, maybe that one core spiked, at the time .but when trying to start the client process anew the same error popped up , cpu load would have been much lower at that point ( 50% )
I will add "nice -n -20 ./build/iox-roudi" if it repeats itself.
running 22 apps at the moment. 2 consumers and the rest producers. Producer throughput is gradually increasing.. but all has been stable since inception (v2 release in mars)

Requirements are that publisher frequency is fixed, data can't be lost and consumers must keep up. If we get to a blocking halt something is wrong and thats a critical error. I wouldn't see the above symptoms in that case would I ?

elfenpiff · 2022-05-04T12:57:42Z

@niclar

I think the 20 producer are the cause of your issue. I will implement a smarter waiting mechanism in the next days to solve this problem once and for all.

I will ping you when the PR is out.

The nice -n -20 ./build/iox-roudi should be only a temporary solution but for the time being it could solve your problem. But it can cause other problems for instance that all the remaining apps run much slower.

…lures from duration Signed-off-by: Christian Eltzschig <me@elchris.org>

Signed-off-by: Christian Eltzschig <me@elchris.org>

…readme and release notes Signed-off-by: Christian Eltzschig <me@elchris.org>

niclar · 2022-05-06T07:27:03Z

we experienced the very same issue this morning (10publishers 1subscriber) and nothing (rm -r /tmp/roudi & restarting roudi, subscribers and publishers) but a reboot solved it.

elfenpiff · 2022-05-06T07:30:54Z

@niclar The PR #1349 should fix this issue. At the moment it requires some fine tuning but I think that it should be merged next week.

niclar · 2022-05-06T07:32:05Z

@elfenpiff many thanks. I'll monitor it

… much simpler Signed-off-by: Christian Eltzschig <me@elchris.org>

…atory doxygen comments to adaptive_wait and duration. Add test to verify increasing wait of adaptive_wait Signed-off-by: Christian Eltzschig <me@elchris.org>

Signed-off-by: Christian Eltzschig <me@elchris.org>

Iox #1347 implement adaptive_wait

elfenpiff · 2022-05-17T14:09:07Z

@niclar Could you please try your setup with the newest master. Your problem should be solved now, if not please reopen this issue again.

niclar · 2022-05-18T11:49:03Z

@elfenpiff I can't seem to be able to reopen this issue. But we just experienced this issue again with the HEAD 7ef8462.

Publisher:
/mnt/c/src/thirdparty/vcpkg/buildtrees/iceoryx/src/a94b9a1d71-f35279b8e1.clean/iceoryx_hoofs/source/posix_wrapper/unix_domain_socket.cpp:249 { cxx::expected iox::posix::UnixDomainSocket::timedSend(const std::string &, const units::Duration &) const -> iox_sendto } ::: [ 11 ] Resource temporarily unavailable
2022-05-18 11:37:44.260 [ Fatal ]: Timeout registering at RouDi. Is RouDi running?
2022-05-18 11:37:44.260 [ Error ]: ICEORYX error! IPC_INTERFACE__REG_ROUDI_NOT_AVAILABLE
libc++abi: terminating

-we had a few "Version mismatch" in RouDi earlier today, but that's fixed, and shouldn't matter I reckon.

-No more publishers can join and introspection also bails

~30% cpu util

elfenpiff · 2022-05-18T18:02:07Z

@niclar It may be possible that some weird high cpu load is somewhere present. Could you please send me the output of htop at the moment when this occurs?

elfenpiff · 2022-05-20T13:09:40Z

@niclar

Another problem could be that you have a lot of applications and you try to start them all at once. Then your system load increases suddenly and RouDi does not get enough cpu time to answer all of your requests.

One simple solution could be to start all applications sequentially with one or two seconds sleep in between. Then all applications should have enough time to register and roudi should get enough cpu time to handle them.

Furthermore, could you please start roudi with ./build/iox-roudi -l debug and print the output. Maybe there is an issue which we are overlooking.

niclar · 2022-06-01T08:43:28Z

@elfenpiff starting roudi with the highest priority seems to have remedied the issue.

I'll close the issue (and re-open it with debug output if we encounter it again)

Thanks for your support

ciandonovan · 2023-11-21T16:38:18Z

This issue occurs for us on some nodes but not others, but as soon as it occurs on one all become non-functional.

It seems to happen more-so when we autostart the set of nodes at boot, beginning execution after multi-user.target at the default user session target. Manually restarting iox-roudi along with all the nodes sometimes solves the issue, but it is in no way deterministic.

Have also tried adjusting the memory pools, but iox-introspection shows no pool exhausted.

Also built iox-roudi with various permutations of the build flags https://github.com/eclipse-iceoryx/iceoryx/blob/master/doc/website/advanced/configuration-guide.md, which solved some issues with port exhaustion, but no combination made this issue reliably go away. Further, no matter how it was tweaked, warnings about too many chunks being held in parallel persisted.

What do you suspect the root cause of this issue with Roudi and the domain sockets is? I saw a suggestion about changing Roudi's priority, but even if that worked, it strikes me as incredibly non-deterministic, which unfortunately negates some of the key touted advantages of Iceoryx.

For context we run the ROS2 Nav2 stack, along with 4 Intel Realsense cameras, 4 other additional cameras, and a few other low-bandwidth nodes. The issue does seem to happen less frequently when not running the Nav2 stack, I suspect because it's quite heavy on pub/sub connections.

niclar · 2023-11-22T10:08:20Z

@ciandonovan, "too many chunks being held in parallel persisted." -is a different issue/limit, change IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY and/or IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY respectively

ciandonovan · 2023-11-22T17:52:43Z

Increasing IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY from the default 8 to 16 causes publishing and subscribing to silently hang, although listing topics still works fine interestingly.

Increasing IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY from the default 256 to 512 causes a similar issue with [Warning]: Application iceoryx_rt_48_1700675369174225383 not responding.

Resetting both to their defaults restores basic functionality.

Could be an issue with the RMW.

elfenpiff self-assigned this May 4, 2022

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022

iox-eclipse-iceoryx#1347 Remove misplaced todos about warning and fai…

1c76d9f

…lures from duration Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022

iox-eclipse-iceoryx#1347 Implement spinator

27e48a7

Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022

iox-eclipse-iceoryx#1347 Add spinator tests

f7711af

Signed-off-by: Christian Eltzschig <me@elchris.org>

elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022

iox-eclipse-iceoryx#1347 Fix Duration multiplication with floating point

e1f9242

elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 4, 2022

iox-eclipse-iceoryx#1347 Fix Duration multiplication with floating point

110da10

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022

iox-eclipse-iceoryx#1347 The spinator uses exponential time increase

96cd1ca

Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022

iox-eclipse-iceoryx#1347 Adjust spinator test, add spinator to hoofs …

1e525be

…readme and release notes Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022

iox-eclipse-iceoryx#1347 Adjust spinator test, add spinator to hoofs …

6cae17c

…readme and release notes Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022

iox-eclipse-iceoryx#1347 Adjust spinator test, add spinator to hoofs …

200baa7

…readme and release notes Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 5, 2022

iox-eclipse-iceoryx#1347 Adjust spinator test, add spinator to hoofs …

a5a4ea5

…readme and release notes Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 6, 2022

iox-eclipse-iceoryx#1347 Rewrite spinator waiting strategy to make it…

ad86852

… much simpler Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 6, 2022

iox-eclipse-iceoryx#1347 Rewrite spinator waiting strategy to make it…

15223b9

… much simpler Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 16, 2022

iox-eclipse-iceoryx#1347 Fix typos

4ce0b7f

Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 16, 2022

iox-eclipse-iceoryx#1347 Add test ids

daee78d

Signed-off-by: Christian Eltzschig <me@elchris.org>

elfenpiff mentioned this issue May 16, 2022

Iox #1347 implement adaptive_wait #1349

Merged

21 tasks

elfenpiff closed this as completed in #1349 May 17, 2022

elfenpiff added a commit that referenced this issue May 17, 2022

Merge pull request #1349 from ApexAI/iox-#1347-implement-spinator

7ef8462

Iox #1347 implement adaptive_wait

elfenpiff reopened this May 18, 2022

niclar closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ 11 ] Resource temporarily unavailable #1347

[ 11 ] Resource temporarily unavailable #1347

niclar commented May 4, 2022 •

edited

Loading

elfenpiff commented May 4, 2022

elfenpiff commented May 4, 2022 •

edited

Loading

niclar commented May 4, 2022 •

edited

Loading

elfenpiff commented May 4, 2022

niclar commented May 6, 2022 •

edited

Loading

elfenpiff commented May 6, 2022

niclar commented May 6, 2022

elfenpiff commented May 17, 2022

niclar commented May 18, 2022 •

edited

Loading

elfenpiff commented May 18, 2022

elfenpiff commented May 20, 2022

niclar commented Jun 1, 2022

ciandonovan commented Nov 21, 2023

niclar commented Nov 22, 2023

ciandonovan commented Nov 22, 2023

[ 11 ] Resource temporarily unavailable #1347

[ 11 ] Resource temporarily unavailable #1347

Comments

niclar commented May 4, 2022 • edited Loading

elfenpiff commented May 4, 2022

elfenpiff commented May 4, 2022 • edited Loading

niclar commented May 4, 2022 • edited Loading

elfenpiff commented May 4, 2022

niclar commented May 6, 2022 • edited Loading

elfenpiff commented May 6, 2022

niclar commented May 6, 2022

elfenpiff commented May 17, 2022

niclar commented May 18, 2022 • edited Loading

elfenpiff commented May 18, 2022

elfenpiff commented May 20, 2022

niclar commented Jun 1, 2022

ciandonovan commented Nov 21, 2023

niclar commented Nov 22, 2023

ciandonovan commented Nov 22, 2023

niclar commented May 4, 2022 •

edited

Loading

elfenpiff commented May 4, 2022 •

edited

Loading

niclar commented May 4, 2022 •

edited

Loading

niclar commented May 6, 2022 •

edited

Loading

niclar commented May 18, 2022 •

edited

Loading