Skip to content

Fix a bug in the dispatch function #4500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Fix a bug in the dispatch function #4500

wants to merge 2 commits into from

Conversation

hesolium
Copy link

@hesolium hesolium commented Jun 8, 2017

On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999.
In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever".
This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)

hesolium added 2 commits June 8, 2017 11:01
On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999.
In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever".
This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)
@hesolium hesolium changed the title Create equeue.c Bug in dispatch function Jun 8, 2017
@0xc0170 0xc0170 requested review from pan- and geky June 8, 2017 09:31
@pan-
Copy link
Member

pan- commented Jun 8, 2017

@hesolium With which git revision of mbed OS do you encounter this issue ?
Could you detail your issue a bit more and give us a way to reproduce it ?

@hesolium
Copy link
Author

hesolium commented Jun 8, 2017

I think its simple to reproduce and not depend on revision. Modify blinky example. Declare EventQueue and put into endless loop call 'queue.dispatch(0)'. After several minutes LED stop blinking.
Now you can run debugger and halt program. You can see that program bogged down inside dispatch function (waiting for event) because 'deadline' == 511999 (in my case). 'deadline' should not be larger than timeout (variable 'ms'). In my case I run dispatch with timeout ZERO. Of course my program is more complicated but I observe this phenomenon because I left the program running with no activity for a longer time. I think that timer overflow and computation of deadline not take this into account.

@hesolium
Copy link
Author

hesolium commented Jun 8, 2017

Additional info. Wrong value of 'deadline' is computed in line 409:
'deadline = equeue_tickdiff(timeout, tick);"
Theoretically (read comments inside) this function is immune to timer overflow.

@hesolium
Copy link
Author

hesolium commented Jun 8, 2017

Could this be due to the sleep() function. Sleep is called while waiting for an event in 'equeue_sema_wait()'. Maybe after long time NRF52 go to sleep and wake up on timer interrupt. but difference between 'timeout' (= tick() + ms) at function begin and tick() after wakeup is large.

@tomasfrisbergublox
Copy link

@0xc0170
@andreaslarssonublox

There may be an additional problem as well causing our use case to assert.

In the "equeue_post" the semaphore count is increased. Hence, for every event we post it will increase by one. The default max is 1024 which we hit after some tests running.

In the "equeue_dispatch" function the events are executed in the first "while(es)" loop.
// dispatch events
while (es) {
.......
However, the semaphore count is not counted down at this stage.

Further down, before releasing any sempahores, there is a check if we should return. Since we use not timeout (0ms), I guess that this is where the dispatch function will return for us.
// check if we should stop dispatching soon
if (ms >= 0) {
.......

This probably means that for our use case, where we post events with timeout of 0, we will only increase the semaphore count and never decrease it. Hence, sooner or later we hit the max count and asserts (which is what happens).

Furthermore, If I understand it correctly, the semaphore count should always be down to 0 once the wait function is reached.
// wait for events
equeue_sema_wait(&q->eventsema, deadline);
Otherwise the wait function will immediately return since the sema count can be more or less anything depending on the use case. This will not cause a major problem but only cause some additional loops.
We will, if I am not mistaken, only reach this if we use a timeout > 0.

I noticed that there is a previous(?) posix implementation ("equeue_posix.c") where it seems to be a binary semaphore (bool), which means this problem will not occur.

@pan-
Copy link
Member

pan- commented Jun 15, 2017

@tomasfrisbergublox

This probably means that for our use case, where we post events with timeout of 0, we will only increase the semaphore count and never decrease it. Hence, sooner or later we hit the max count and asserts (which is what happens).

The counter of semaphore objects in RTX V1 (mbed OS [5.0 - 5.4]) is not bounded by the number of available resources pass during the initialization. A semaphore release operation will increase the counter even if the result is above this number. As noted, the result is that calls to wait might not block even if there is no events in the queue.

The semaphore semantic has changed slightly with RTX 2 (mbed OS 5.5). A release operation will not create new resources tokens.

Did you see a change in your application behavior if you use the master of the repo rather than mbed OS (5.4.x) ?

@tomasfrisbergublox
Copy link

tomasfrisbergublox commented Jun 15, 2017

@pan-
@andreaslarssonublox

We should be testing with the master branch. I checked again with Andreas and he confirmed this should be the case.

Furthermore, I am running with the debug profile, which means that “DMBED_TRAP_ERRORS_ENABLED” is set
"GCC_ARM": {
"common": ["-c", "-Wall", "-Wextra",
"-Wno-unused-parameter", "-Wno-missing-field-initializers",
"-fmessage-length=0", "-fno-exceptions", "-fno-builtin",
"-ffunction-sections", "-fdata-sections", "-funsigned-char",
"-MMD", "-fno-delete-null-pointer-checks",
"-fomit-frame-pointer", "-O0", "-g3", "-DMBED_DEBUG",
"-DMBED_TRAP_ERRORS_ENABLED=1"],

Without this flag it does not crash but it should maybe still be considered as an erroneous behavior.

What happens is that we eventually get a call from the “SVC_Handler” to “svcRtxSemaphoreRelease” in “rtos\rtx5\TARGET_CORTEX_M\rtx_semaphore.c”.

The assert is raised when the token is to be released
// Try to release token
if (SemaphoreTokenIncrement(semaphore) == 0U) {
EvrRtxSemaphoreError(semaphore, osRtxErrorSemaphoreCountLimit);
return osErrorResource;
}
EvrRtxSemaphoreReleased(semaphore);

There is a check in “SemaphoreTokenIncrement” that causes the call to return 0, which triggers the “EvrRtxSemaphoreError” call.
if (semaphore->tokens < semaphore->max_tokens) {
semaphore->tokens++;
ret = 1U;
} else {
ret = 0U;
}

The ” EvrRtxSemaphoreError” which is located in “/platform/mbed_retarget.cpp”
extern "C" void EvrRtxSemaphoreError (osSemaphoreId_t semaphore_id, int32_t status)
{
error("Semaphore %p error %i\r\n", semaphore_id, status);
}

In the end we get a printout “Semaphore 0x20013ec8 error -17” and after that nothing is working anymore.

@pan-
Copy link
Member

pan- commented Jun 15, 2017

@tomasfrisbergublox Given that the semaphore is used more or less as a condition variable (operation signal/wait) in the event queue, a call to svcRtxSemaphoreRelease will return an error if the other end didn't acquired any token; it is expected in that precise case.

Unfortunately, MBED_TRAP_ERRORS_ENABLED will trap as soon as the kernel call EvrRtxSemaphoreError.

A solution would be to stop using semaphore objects and start using osEventFlag as a signal mechanism. @geky what do you think ?

@geky
Copy link
Contributor

geky commented Jun 15, 2017

Thanks for all of the investigation! You're right that the event queue library is expecting a binary semaphore. Unfortunately, until the update to RTX 5 we didn't have any way to implement a binary semaphore (we had "thread signals", but this used a bitmask that was shared with users). We assumed that using a counting semaphore would be fine as a temporary solution, thinking that the count would be large enough for us to not have to worry about overflow. It would waste cycles but get the job done.

This was wrong, it turns out RTX just stores the count as a 16-bit number with no overflow protection:

In RTX 5 this limit is dropped down to the arbitrary number 1024:
https://github.com/ARMmbed/mbed-os/blob/master/rtos/Semaphore.cpp#L30

Using the osEventFlag is a perfect solution.

This issue also popped up in the socket layer (and may pop up other places?). @deepikabhavnani was looking at how we should handle binary semaphores in the C++ layer over here #4560.


However, I'm not sure this solves the original issue. The equeue_dispatch function is designed to never call equeue_sema_wait when the timeout is 0. Does the original implementation work in 5.5 with MBED_TRAP_ERRORS_ENABLED disabled?

Possible issues:

  • The equeue_tick function drifts backwards (here). Though this specific case should be tested for over in the events timing test (here). Maybe we need an assert on this condition in the actual dispatch function
  • The equeue_tickdiff function is broken. This may be possible if the compiler doesn't play nice. What compiler and profile are you using?
  • The callback (here) corrupts something.

I'll have to look into this more once I get a chance with a NRF52-DK. You could try running the event timing tests to see if the bug was just missed somehow:

mbed test -m NRF52_DK -t COMPILER -n tests-events-timing -v

@geky geky mentioned this pull request Jun 15, 2017
1 task
@geky
Copy link
Contributor

geky commented Jun 15, 2017

I put up a patch that adopts the osEventFlags: #4571

This should take care of the semaphore issue, feel free to add feedback.

// Unrecognized bug on some platform (deadline == 511999)
if(ms >= 0 && deadline > ms) {
deadline = ms;
if(ms == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geky Before returning, is it necessary to update the background timer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, a better short-term fix is probably to change line 410 above:

    // check if we should stop dispatching soon
    if (ms >= 0) {
        deadline = equeue_tickdiff(timeout, tick);
-       if (deadline <= 0) {
+       if (ms == 0 || deadline <= 0) {
            // update background timer if necessary
            if (q->background.update) {
                equeue_mutex_lock(&q->queuelock);
                if (q->background.update && q->queue) {
                    q->background.update(q->background.timer,
                            equeue_clampdiff(q->queue->target, tick));
                }
                q->background.active = true;
                equeue_mutex_unlock(&q->queuelock);
            }
            return;
        }
    }

@geky geky mentioned this pull request Jun 16, 2017
@theotherjimmy theotherjimmy changed the title Bug in dispatch function Fix a bug in the dispatch function Jun 19, 2017
@theotherjimmy
Copy link
Contributor

theotherjimmy commented Jun 19, 2017

@geky is this still relevant? or does #4571 fix this bug?

@geky
Copy link
Contributor

geky commented Jun 19, 2017

Not sure, I need to reproduce locally

@adbridge
Copy link
Contributor

bump @geky

@theotherjimmy
Copy link
Contributor

@geky were you able to reproduce this locally?

@0xc0170
Copy link
Contributor

0xc0170 commented Jul 4, 2017

@hesolium Is this still relevant? There was a proposal to change this patch, plus some other referenced. Can you retest latest master (rebase this on top of latest master) and see if this patch is still needed. In case yes, can you please respond to comments?

@theotherjimmy
Copy link
Contributor

@hesolium @geky Any news on this?

@theotherjimmy
Copy link
Contributor

@hesolium @geky This has been idle for 2 weeks. We have fixed a few issues in the mean time, and this PR may no longer be relevant. I'm going to close it. Please reopen if it's still relevant.

@geky
Copy link
Contributor

geky commented Jul 20, 2017

This is still relevant, but I haven't been able to verify if the issue is still there or not yet. If we close this I'm probably going to forget about it so I'm going to keep this open. If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

@geky geky reopened this Jul 20, 2017
@0xc0170
Copy link
Contributor

0xc0170 commented Jul 20, 2017

If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

I'll try to summarise this into an issue today.

If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch.

Searching should find this. Opening this PR does not move this anywhere. Tracking issue sounds better.

@0xc0170
Copy link
Contributor

0xc0170 commented Jul 20, 2017

@geky #4784, closing this, please add any details or edit the tracking issue.

@0xc0170 0xc0170 closed this Jul 20, 2017
@hesolium
Copy link
Author

The whole thread focused on correcting the semaphore error. This probably has nothing to do with my case. My suggestion of change was to protect against the large values of variable 'deadline'. Does the semaphore bug fix affect programs that do not use RTOS?

@Eric-A-Marks
Copy link
Contributor

@hesolium I'm seeing this same problem with non-RTOS builds using dispatch(0) in mbed 5.11 (with back ports of most of 5.15) on STM32 targeting ff1705_l151cc.

The fix of if (ms == 0 || deadline <= 0) appears to have solved my problems.

I'm aiming to do further testing with vanilla 5.15 to see if it still occurs, and potentially open a new issue.

@hesolium
Copy link
Author

I am currently using MBED v6.5 (bare-metal). I can check if the bug still exists. I have much better tools (eg Ozone debugger :). I guess I can find the module that is causing the error.

HeS

@hesolium
Copy link
Author

In MBED v6.5 the bug disappeared. It's hard to understand why because the source code (in the 'events' subdirectory) has been deeply refactored (compared to v5.11). I guess all you have to do is runaway to new version:)
Good luck,
HeS

@Eric-A-Marks
Copy link
Contributor

Thanks for checking. Upgrading to 6.x isn't viable for my current project, but I'll keep it in mind next time around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants