Fix a bug in the dispatch function #4500

hesolium · 2017-06-08T09:04:34Z

On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999.
In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever".
This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)

On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999. In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever". This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)

pan- · 2017-06-08T13:12:19Z

@hesolium With which git revision of mbed OS do you encounter this issue ?
Could you detail your issue a bit more and give us a way to reproduce it ?

hesolium · 2017-06-08T14:02:01Z

I think its simple to reproduce and not depend on revision. Modify blinky example. Declare EventQueue and put into endless loop call 'queue.dispatch(0)'. After several minutes LED stop blinking.
Now you can run debugger and halt program. You can see that program bogged down inside dispatch function (waiting for event) because 'deadline' == 511999 (in my case). 'deadline' should not be larger than timeout (variable 'ms'). In my case I run dispatch with timeout ZERO. Of course my program is more complicated but I observe this phenomenon because I left the program running with no activity for a longer time. I think that timer overflow and computation of deadline not take this into account.

hesolium · 2017-06-08T14:14:04Z

Additional info. Wrong value of 'deadline' is computed in line 409:
'deadline = equeue_tickdiff(timeout, tick);"
Theoretically (read comments inside) this function is immune to timer overflow.

hesolium · 2017-06-08T15:30:10Z

Could this be due to the sleep() function. Sleep is called while waiting for an event in 'equeue_sema_wait()'. Maybe after long time NRF52 go to sleep and wake up on timer interrupt. but difference between 'timeout' (= tick() + ms) at function begin and tick() after wakeup is large.

tomasfrisbergublox · 2017-06-15T08:15:28Z

@0xc0170
@andreaslarssonublox

There may be an additional problem as well causing our use case to assert.

In the "equeue_post" the semaphore count is increased. Hence, for every event we post it will increase by one. The default max is 1024 which we hit after some tests running.

In the "equeue_dispatch" function the events are executed in the first "while(es)" loop.
// dispatch events
while (es) {
.......
However, the semaphore count is not counted down at this stage.

Further down, before releasing any sempahores, there is a check if we should return. Since we use not timeout (0ms), I guess that this is where the dispatch function will return for us.
// check if we should stop dispatching soon
if (ms >= 0) {
.......

This probably means that for our use case, where we post events with timeout of 0, we will only increase the semaphore count and never decrease it. Hence, sooner or later we hit the max count and asserts (which is what happens).

Furthermore, If I understand it correctly, the semaphore count should always be down to 0 once the wait function is reached.
// wait for events
equeue_sema_wait(&q->eventsema, deadline);
Otherwise the wait function will immediately return since the sema count can be more or less anything depending on the use case. This will not cause a major problem but only cause some additional loops.
We will, if I am not mistaken, only reach this if we use a timeout > 0.

I noticed that there is a previous(?) posix implementation ("equeue_posix.c") where it seems to be a binary semaphore (bool), which means this problem will not occur.

pan- · 2017-06-15T09:29:37Z

@tomasfrisbergublox

This probably means that for our use case, where we post events with timeout of 0, we will only increase the semaphore count and never decrease it. Hence, sooner or later we hit the max count and asserts (which is what happens).

The counter of semaphore objects in RTX V1 (mbed OS [5.0 - 5.4]) is not bounded by the number of available resources pass during the initialization. A semaphore release operation will increase the counter even if the result is above this number. As noted, the result is that calls to wait might not block even if there is no events in the queue.

The semaphore semantic has changed slightly with RTX 2 (mbed OS 5.5). A release operation will not create new resources tokens.

Did you see a change in your application behavior if you use the master of the repo rather than mbed OS (5.4.x) ?

tomasfrisbergublox · 2017-06-15T11:54:17Z

@pan-
@andreaslarssonublox

We should be testing with the master branch. I checked again with Andreas and he confirmed this should be the case.

Furthermore, I am running with the debug profile, which means that “DMBED_TRAP_ERRORS_ENABLED” is set
"GCC_ARM": {
"common": ["-c", "-Wall", "-Wextra",
"-Wno-unused-parameter", "-Wno-missing-field-initializers",
"-fmessage-length=0", "-fno-exceptions", "-fno-builtin",
"-ffunction-sections", "-fdata-sections", "-funsigned-char",
"-MMD", "-fno-delete-null-pointer-checks",
"-fomit-frame-pointer", "-O0", "-g3", "-DMBED_DEBUG",
"-DMBED_TRAP_ERRORS_ENABLED=1"],

Without this flag it does not crash but it should maybe still be considered as an erroneous behavior.

What happens is that we eventually get a call from the “SVC_Handler” to “svcRtxSemaphoreRelease” in “rtos\rtx5\TARGET_CORTEX_M\rtx_semaphore.c”.

The assert is raised when the token is to be released
// Try to release token
if (SemaphoreTokenIncrement(semaphore) == 0U) {
EvrRtxSemaphoreError(semaphore, osRtxErrorSemaphoreCountLimit);
return osErrorResource;
}
EvrRtxSemaphoreReleased(semaphore);

There is a check in “SemaphoreTokenIncrement” that causes the call to return 0, which triggers the “EvrRtxSemaphoreError” call.
if (semaphore->tokens < semaphore->max_tokens) {
semaphore->tokens++;
ret = 1U;
} else {
ret = 0U;
}

The ” EvrRtxSemaphoreError” which is located in “/platform/mbed_retarget.cpp”
extern "C" void EvrRtxSemaphoreError (osSemaphoreId_t semaphore_id, int32_t status)
{
error("Semaphore %p error %i\r\n", semaphore_id, status);
}

In the end we get a printout “Semaphore 0x20013ec8 error -17” and after that nothing is working anymore.

pan- · 2017-06-15T13:16:20Z

@tomasfrisbergublox Given that the semaphore is used more or less as a condition variable (operation signal/wait) in the event queue, a call to svcRtxSemaphoreRelease will return an error if the other end didn't acquired any token; it is expected in that precise case.

Unfortunately, MBED_TRAP_ERRORS_ENABLED will trap as soon as the kernel call EvrRtxSemaphoreError.

A solution would be to stop using semaphore objects and start using osEventFlag as a signal mechanism. @geky what do you think ?

geky · 2017-06-15T17:48:54Z

Thanks for all of the investigation! You're right that the event queue library is expecting a binary semaphore. Unfortunately, until the update to RTX 5 we didn't have any way to implement a binary semaphore (we had "thread signals", but this used a bitmask that was shared with users). We assumed that using a counting semaphore would be fine as a temporary solution, thinking that the count would be large enough for us to not have to worry about overflow. It would waste cycles but get the job done.

This was wrong, it turns out RTX just stores the count as a 16-bit number with no overflow protection:

mbed-os/rtos/rtx/TARGET_CORTEX_M/rt_Semaphore.c

Line 112 in 5fff7e1

p_SCB->tokens++;

In RTX 5 this limit is dropped down to the arbitrary number 1024:
https://github.com/ARMmbed/mbed-os/blob/master/rtos/Semaphore.cpp#L30

Using the osEventFlag is a perfect solution.

This issue also popped up in the socket layer (and may pop up other places?). @deepikabhavnani was looking at how we should handle binary semaphores in the C++ layer over here #4560.

However, I'm not sure this solves the original issue. The equeue_dispatch function is designed to never call equeue_sema_wait when the timeout is 0. Does the original implementation work in 5.5 with MBED_TRAP_ERRORS_ENABLED disabled?

Possible issues:

The equeue_tick function drifts backwards (here). Though this specific case should be tested for over in the events timing test (here). Maybe we need an assert on this condition in the actual dispatch function
The equeue_tickdiff function is broken. This may be possible if the compiler doesn't play nice. What compiler and profile are you using?
The callback (here) corrupts something.

I'll have to look into this more once I get a chance with a NRF52-DK. You could try running the event timing tests to see if the bug was just missed somehow:

mbed test -m NRF52_DK -t COMPILER -n tests-events-timing -v

geky · 2017-06-15T21:00:26Z

I put up a patch that adopts the osEventFlags: #4571

This should take care of the semaphore issue, feel free to add feedback.

pan- · 2017-06-16T12:08:43Z

events/equeue/equeue.c

+        // Unrecognized bug on some platform (deadline == 511999)
+        if(ms >= 0 && deadline > ms) {
+            deadline = ms;
+            if(ms == 0)


@geky Before returning, is it necessary to update the background timer ?

Ah yes, a better short-term fix is probably to change line 410 above:

// check if we should stop dispatching soon if (ms >= 0) { deadline = equeue_tickdiff(timeout, tick); - if (deadline <= 0) { + if (ms == 0 || deadline <= 0) { // update background timer if necessary if (q->background.update) { equeue_mutex_lock(&q->queuelock); if (q->background.update && q->queue) { q->background.update(q->background.timer, equeue_clampdiff(q->queue->target, tick)); } q->background.active = true; equeue_mutex_unlock(&q->queuelock); } return; } }

theotherjimmy · 2017-06-19T16:28:32Z

@geky is this still relevant? or does #4571 fix this bug?

geky · 2017-06-19T16:55:11Z

Not sure, I need to reproduce locally

adbridge · 2017-06-23T09:23:05Z

bump @geky

theotherjimmy · 2017-06-26T14:54:50Z

@geky were you able to reproduce this locally?

0xc0170 · 2017-07-04T09:41:17Z

@hesolium Is this still relevant? There was a proposal to change this patch, plus some other referenced. Can you retest latest master (rebase this on top of latest master) and see if this patch is still needed. In case yes, can you please respond to comments?

theotherjimmy · 2017-07-10T15:51:11Z

@hesolium @geky Any news on this?

theotherjimmy · 2017-07-17T15:36:47Z

@hesolium @geky This has been idle for 2 weeks. We have fixed a few issues in the mean time, and this PR may no longer be relevant. I'm going to close it. Please reopen if it's still relevant.

geky · 2017-07-20T00:19:20Z

This is still relevant, but I haven't been able to verify if the issue is still there or not yet. If we close this I'm probably going to forget about it so I'm going to keep this open. If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

0xc0170 · 2017-07-20T07:35:48Z

If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

I'll try to summarise this into an issue today.

If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

Searching should find this. Opening this PR does not move this anywhere. Tracking issue sounds better.

0xc0170 · 2017-07-20T07:40:31Z

@geky #4784, closing this, please add any details or edit the tracking issue.

hesolium · 2017-07-20T08:29:52Z

The whole thread focused on correcting the semaphore error. This probably has nothing to do with my case. My suggestion of change was to protect against the large values of variable 'deadline'. Does the semaphore bug fix affect programs that do not use RTOS?

Eric-A-Marks · 2020-12-22T12:41:33Z

@hesolium I'm seeing this same problem with non-RTOS builds using dispatch(0) in mbed 5.11 (with back ports of most of 5.15) on STM32 targeting ff1705_l151cc.

The fix of if (ms == 0 || deadline <= 0) appears to have solved my problems.

I'm aiming to do further testing with vanilla 5.15 to see if it still occurs, and potentially open a new issue.

hesolium · 2020-12-22T17:13:06Z

I am currently using MBED v6.5 (bare-metal). I can check if the bug still exists. I have much better tools (eg Ozone debugger :). I guess I can find the module that is causing the error.

HeS

hesolium · 2020-12-22T21:44:21Z

In MBED v6.5 the bug disappeared. It's hard to understand why because the source code (in the 'events' subdirectory) has been deeply refactored (compared to v5.11). I guess all you have to do is runaway to new version:)
Good luck,
HeS

Eric-A-Marks · 2020-12-22T22:09:39Z

Thanks for checking. Upgrading to 6.x isn't viable for my current project, but I'll keep it in mind next time around.

hesolium added 2 commits June 8, 2017 11:01

Create equeue.c

d117e61

hesolium changed the title ~~Create equeue.c~~ Bug in dispatch function Jun 8, 2017

0xc0170 requested review from pan- and geky June 8, 2017 09:31

0xc0170 added the needs: review label Jun 8, 2017

andreaslarssonublox mentioned this pull request Jun 15, 2017

Updated u-blox ODIN-W2 driver binaries to 2.1 rc1 #4567

Merged

geky mentioned this pull request Jun 15, 2017

events: Adopt osEventFlags from RTX 5 #4571

Merged

1 task

pan- reviewed Jun 16, 2017

View reviewed changes

geky mentioned this pull request Jun 16, 2017

RTOS: Fix semaphore #4579

Merged

theotherjimmy changed the title ~~Bug in dispatch function~~ Fix a bug in the dispatch function Jun 19, 2017

theotherjimmy closed this Jul 17, 2017

theotherjimmy removed the needs: review label Jul 17, 2017

geky reopened this Jul 20, 2017

0xc0170 mentioned this pull request Jul 20, 2017

Event queue - dispatch timeout bug #4784

Closed

0xc0170 closed this Jul 20, 2017

Fix a bug in the dispatch function #4500

Fix a bug in the dispatch function #4500

Uh oh!

Conversation

hesolium commented Jun 8, 2017 • edited by 0xc0170 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan- commented Jun 8, 2017

Uh oh!

hesolium commented Jun 8, 2017

Uh oh!

hesolium commented Jun 8, 2017

Uh oh!

hesolium commented Jun 8, 2017

Uh oh!

tomasfrisbergublox commented Jun 15, 2017

Uh oh!

pan- commented Jun 15, 2017

Uh oh!

tomasfrisbergublox commented Jun 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan- commented Jun 15, 2017

Uh oh!

geky commented Jun 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geky commented Jun 15, 2017

Uh oh!

pan- Jun 16, 2017

Choose a reason for hiding this comment

Uh oh!

geky Jun 16, 2017

Choose a reason for hiding this comment

Uh oh!

theotherjimmy commented Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geky commented Jun 19, 2017

Uh oh!

adbridge commented Jun 23, 2017

Uh oh!

theotherjimmy commented Jun 26, 2017

Uh oh!

0xc0170 commented Jul 4, 2017

Uh oh!

theotherjimmy commented Jul 10, 2017

Uh oh!

theotherjimmy commented Jul 17, 2017

Uh oh!

geky commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xc0170 commented Jul 20, 2017

Uh oh!

0xc0170 commented Jul 20, 2017

Uh oh!

hesolium commented Jul 20, 2017

Uh oh!

Eric-A-Marks commented Dec 22, 2020

Uh oh!

hesolium commented Dec 22, 2020

Uh oh!

hesolium commented Dec 22, 2020

Uh oh!

Eric-A-Marks commented Dec 22, 2020

Uh oh!

Uh oh!

hesolium commented Jun 8, 2017 •

edited by 0xc0170

Loading

tomasfrisbergublox commented Jun 15, 2017 •

edited

Loading

geky commented Jun 15, 2017 •

edited

Loading

theotherjimmy commented Jun 19, 2017 •

edited

Loading

geky commented Jul 20, 2017 •

edited

Loading