-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix a bug in the dispatch function #4500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999. In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever". This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)
@hesolium With which git revision of mbed OS do you encounter this issue ? |
I think its simple to reproduce and not depend on revision. Modify blinky example. Declare EventQueue and put into endless loop call 'queue.dispatch(0)'. After several minutes LED stop blinking. |
Additional info. Wrong value of 'deadline' is computed in line 409: |
Could this be due to the sleep() function. Sleep is called while waiting for an event in 'equeue_sema_wait()'. Maybe after long time NRF52 go to sleep and wake up on timer interrupt. but difference between 'timeout' (= tick() + ms) at function begin and tick() after wakeup is large. |
There may be an additional problem as well causing our use case to assert. In the "equeue_post" the semaphore count is increased. Hence, for every event we post it will increase by one. The default max is 1024 which we hit after some tests running. In the "equeue_dispatch" function the events are executed in the first "while(es)" loop. Further down, before releasing any sempahores, there is a check if we should return. Since we use not timeout (0ms), I guess that this is where the dispatch function will return for us. This probably means that for our use case, where we post events with timeout of 0, we will only increase the semaphore count and never decrease it. Hence, sooner or later we hit the max count and asserts (which is what happens). Furthermore, If I understand it correctly, the semaphore count should always be down to 0 once the wait function is reached. I noticed that there is a previous(?) posix implementation ("equeue_posix.c") where it seems to be a binary semaphore (bool), which means this problem will not occur. |
The counter of semaphore objects in RTX V1 (mbed OS [5.0 - 5.4]) is not bounded by the number of available resources pass during the initialization. A semaphore release operation will increase the counter even if the result is above this number. As noted, the result is that calls to wait might not block even if there is no events in the queue. The semaphore semantic has changed slightly with RTX 2 (mbed OS 5.5). A release operation will not create new resources tokens. Did you see a change in your application behavior if you use the master of the repo rather than mbed OS (5.4.x) ? |
We should be testing with the master branch. I checked again with Andreas and he confirmed this should be the case. Furthermore, I am running with the debug profile, which means that “DMBED_TRAP_ERRORS_ENABLED” is set Without this flag it does not crash but it should maybe still be considered as an erroneous behavior. What happens is that we eventually get a call from the “SVC_Handler” to “svcRtxSemaphoreRelease” in “rtos\rtx5\TARGET_CORTEX_M\rtx_semaphore.c”. The assert is raised when the token is to be released There is a check in “SemaphoreTokenIncrement” that causes the call to return 0, which triggers the “EvrRtxSemaphoreError” call. The ” EvrRtxSemaphoreError” which is located in “/platform/mbed_retarget.cpp” In the end we get a printout “Semaphore 0x20013ec8 error -17” and after that nothing is working anymore. |
@tomasfrisbergublox Given that the semaphore is used more or less as a condition variable (operation signal/wait) in the event queue, a call to Unfortunately, A solution would be to stop using semaphore objects and start using |
Thanks for all of the investigation! You're right that the event queue library is expecting a binary semaphore. Unfortunately, until the update to RTX 5 we didn't have any way to implement a binary semaphore (we had "thread signals", but this used a bitmask that was shared with users). We assumed that using a counting semaphore would be fine as a temporary solution, thinking that the count would be large enough for us to not have to worry about overflow. It would waste cycles but get the job done. This was wrong, it turns out RTX just stores the count as a 16-bit number with no overflow protection:
In RTX 5 this limit is dropped down to the arbitrary number 1024: Using the This issue also popped up in the socket layer (and may pop up other places?). @deepikabhavnani was looking at how we should handle binary semaphores in the C++ layer over here #4560. However, I'm not sure this solves the original issue. The equeue_dispatch function is designed to never call equeue_sema_wait when the timeout is 0. Does the original implementation work in 5.5 with MBED_TRAP_ERRORS_ENABLED disabled? Possible issues:
I'll have to look into this more once I get a chance with a NRF52-DK. You could try running the event timing tests to see if the bug was just missed somehow: mbed test -m NRF52_DK -t COMPILER -n tests-events-timing -v |
I put up a patch that adopts the osEventFlags: #4571 This should take care of the semaphore issue, feel free to add feedback. |
// Unrecognized bug on some platform (deadline == 511999) | ||
if(ms >= 0 && deadline > ms) { | ||
deadline = ms; | ||
if(ms == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geky Before returning, is it necessary to update the background timer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, a better short-term fix is probably to change line 410 above:
// check if we should stop dispatching soon
if (ms >= 0) {
deadline = equeue_tickdiff(timeout, tick);
- if (deadline <= 0) {
+ if (ms == 0 || deadline <= 0) {
// update background timer if necessary
if (q->background.update) {
equeue_mutex_lock(&q->queuelock);
if (q->background.update && q->queue) {
q->background.update(q->background.timer,
equeue_clampdiff(q->queue->target, tick));
}
q->background.active = true;
equeue_mutex_unlock(&q->queuelock);
}
return;
}
}
Not sure, I need to reproduce locally |
bump @geky |
@geky were you able to reproduce this locally? |
@hesolium Is this still relevant? There was a proposal to change this patch, plus some other referenced. Can you retest latest master (rebase this on top of latest master) and see if this patch is still needed. In case yes, can you please respond to comments? |
This is still relevant, but I haven't been able to verify if the issue is still there or not yet. If we close this I'm probably going to forget about it so I'm going to keep this open. If we really want to close this please create an issue with a link here, I really don't want someone else to run into this bug and we have to debug it from scratch. |
I'll try to summarise this into an issue today.
Searching should find this. Opening this PR does not move this anywhere. Tracking issue sounds better. |
The whole thread focused on correcting the semaphore error. This probably has nothing to do with my case. My suggestion of change was to protect against the large values of variable 'deadline'. Does the semaphore bug fix affect programs that do not use RTOS? |
@hesolium I'm seeing this same problem with non-RTOS builds using dispatch(0) in mbed 5.11 (with back ports of most of 5.15) on STM32 targeting ff1705_l151cc. The fix of I'm aiming to do further testing with vanilla 5.15 to see if it still occurs, and potentially open a new issue. |
I am currently using MBED v6.5 (bare-metal). I can check if the bug still exists. I have much better tools (eg Ozone debugger :). I guess I can find the module that is causing the error. HeS |
In MBED v6.5 the bug disappeared. It's hard to understand why because the source code (in the 'events' subdirectory) has been deeply refactored (compared to v5.11). I guess all you have to do is runaway to new version:) |
Thanks for checking. Upgrading to 6.x isn't viable for my current project, but I'll keep it in mind next time around. |
On Mbed os 5.4, NRF52-DK, no RTOS, no active SoftDevice after long time (timer/counter overflow?) 'deadline' == 511999.
In my case timeout (parameters ms) == ZERO and dispatch(0) run "forever".
This modification is only simply safeguard. I dont know where is real problem (Mbed, Nordic drivers, ...)