Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: threads using k_sleep can be _swap()'d back too early #8128

Closed
mike-scott opened this issue Jun 3, 2018 · 25 comments
Closed

scheduler: threads using k_sleep can be _swap()'d back too early #8128

mike-scott opened this issue Jun 3, 2018 · 25 comments
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Milestone

Comments

@mike-scott
Copy link
Contributor

mike-scott commented Jun 3, 2018

Scheduler uses a function "should_preempt()" to determine if the _current thread should be _swap()'d away from. Part of that calculation uses a function _is_thread_prevented_from_running() which evaluates if the _current thread is "runnable".

Currently, the k_sleep() API doesn't make a thread look like it isn't runnable other than the following expression being true:
thread->base.timeout.delta_ticks_from_prev != _INACTIVE.

Originally, I felt that _is_thread_ready() would be a better replacement for _is_thread_prevented_from_running() as this takes into account active timeouts and would handle the k_sleep() case. But, a patch which changed this behavior is causing problems with the scheduler tests, so more research is needed.

@mike-scott
Copy link
Contributor Author

@nashif @andyross I'm going to try and document every "issue" concerning the scheduler over the next day so we can see all of the various symptoms still in play. From there we can de-dupe as items are debugged.

This is issue #1. I attempted a patch:
#8126

But it's causing other issues, currently.

@mike-scott
Copy link
Contributor Author

Found the bug in #8126 and I'm running my own personal soak test with 13 IoT devices to see how well they perform.

@mike-scott
Copy link
Contributor Author

@andyross I ran a local sanitycheck with #8126 and it's passing now as well.

@andyross
Copy link
Contributor

andyross commented Jun 4, 2018

So.... looking at this from the context of trying to make it happen in a test, I'm flipping back. I thought I had a test case for this (about to submit) that showed the failure, but that was a bad assert. k_sleep() actually works correctly right now, though the logic is distressingly subtle:

  • Sleep does not flag the thread PENDING or otherwise unrunable, but it does call _remove_thread_from_ready_q()

  • _r_t_f_r_q() then does the removal, and (because _current is the thread being removed) calls update_cache() with preempt_ok as TRUE, indicating that the cache should be updated to reflect whatever next thread was found even if the current thread is otherwise nonpreemptible.

  • update_cache() then calls next_up() to retrieve the next thread, which because _current was removed is guaranteed to be something else (an idle thread if nothing else is available).

  • Then it calls should_preempt() to see if it should replace _current with whatever it found, which will be 1 because preempt_ok is a short-circuit case there.

  • Then, having updated the cache, we return back into k_sleep() and call _Swap(), guaranteeing we stop running and switch to the cache thrad. And because we are no longer in the run queue, we are guaranteed to

Basically, I don't see how this can be the bug. The k_sleep() state management works now for, effecitvely, the same reason it worked before the change. I agree we want to clean it up (ideally to make PENDING and "sleeping" the same state), but I don't see a logic failure.

All that said, I'm leaving the +1 in place and agree we should merge this. The change is correct logically, easily reviewed, and I'm willing to buy that it fixes some other yet-more-subtle state somewhere else, given that it seems to be resolving your problems.

@andyross
Copy link
Contributor

andyross commented Jun 4, 2018

FWIW: if you want to catch the presumptive bug closer to the act, one suggestion would be to enable assertions (if possible, otherwise just a manual test + log/abort/whatever) in should_preempt() that looks like:

__ASSERT_NO_MSG(_is_thread_prevented_from_running(_current) != _is_thread_ready(_current));

I'd be very curious to see the results. A pretty careful re-read and grep of every path I can see to _add_timeout() shows that it's never possible for a thread to have a timeout in a case where it gets inspected as the top of the run queue.

@mike-scott
Copy link
Contributor Author

@andyross I agree with your assessment that k_sleep does manage to sort of work initially.

However, with patch applied, I've had no oops or hangs in the 5 hours of testing on several devices. Previously, I would have seen at least 1 device go offline by now.

I'll setup an assert to see if I can catch the path that is causing errors.

@pfalcon pfalcon changed the title scheduler: threads using k_sleep can be _swap()'d back to early scheduler: threads using k_sleep can be _swap()'d back too early Jun 4, 2018
@pfalcon
Copy link
Contributor

pfalcon commented Jun 4, 2018

I'll setup an assert to see if I can catch the path that is causing errors.

Not sure if this should be reopened then.

@mike-scott
Copy link
Contributor Author

@pfalcon Hard to say.

The reality is that 6c95daf uses the "correct" check which makes sure (even in some back water path) that the thread we are moving to is in fact ready.

It has also been running on several of my devices for the past 18-ish hours without resulting in a hang or kernel oops which was previously showing up after an hour or so.

@andyross has agreed that there is more work to do in the scheduler with regards to cleanup and the first patch he is submitting in v1.13 will move k_sleep towards using the same pend/unpend methods that the other APIs use so it won't be a snowflake like it is now.

Should we keep a general purpose "Cleanup the scheduler and look for concurrency bugs" issue open? Maybe? But I think this issue is a bit too specific for that.

@pfalcon
Copy link
Contributor

pfalcon commented Jun 4, 2018

Sounds good, I hope news about the assert above won't be lost/forgotten then ;-) (even if the news are "didn't get to it").

@mike-scott
Copy link
Contributor Author

mike-scott commented Jun 5, 2018

So the k_sleep() issue is real.

Using this diff on master branch: https://hastebin.com/ihoyazofad.diff

I generate this log while performing an OTA update using Zephyr sample: https://hastebin.com/uxojakapul.log

Here's the backtrace from gdb: https://hastebin.com/bifoneripu.pas

(gdb) bt
#0  k_cpu_idle () at /home/scottml/osf/zmp-osf/zephyr/arch/arm/core/cpu_idle.S:135
#1  0x0000cdd0 in _SysFatalErrorHandler (reason=<optimized out>, pEsf=<optimized out>)
    at /home/scottml/osf/zmp-osf/zephyr/arch/arm/core/sys_fatal_error_handler.c:70
#2  0x0000c94e in _oops () at /home/scottml/osf/zmp-osf/zephyr/arch/arm/core/swap_helper.S:345
#3  <signal handler called>
#4  should_preempt (th=<optimized out>, preempt_ok=0) at /home/scottml/osf/zmp-osf/zephyr/kernel/sched.c:129
#5  update_cache (preempt_ok=preempt_ok@entry=0) at /home/scottml/osf/zmp-osf/zephyr/kernel/sched.c:215
#6  0x0002cc78 in _add_thread_to_ready_q (thread=0x200043f8 <prio_recv_thread_data>)
    at /home/scottml/osf/zmp-osf/zephyr/kernel/sched.c:237
#7  0x0002cf90 in _ready_thread (thread=0x200043f8 <prio_recv_thread_data>)
    at /home/scottml/osf/zmp-osf/zephyr/kernel/include/ksched.h:240
#8  do_sem_give (sem=0x2000a118 <sem_prio_recv>) at /home/scottml/osf/zmp-osf/zephyr/kernel/sem.c:102
#9  _impl_k_sem_give (sem=0x2000a118 <sem_prio_recv>) at /home/scottml/osf/zmp-osf/zephyr/kernel/sem.c:137
#10 0x0001af80 in isr_rx_conn (trx_done=<optimized out>, rssi_ready=<optimized out>, crc_ok=1 '\001')
    at /home/scottml/osf/zmp-osf/zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:3695
#11 isr_radio_state_rx (rssi_ready=<optimized out>, irkmatch_id=<optimized out>, irkmatch_ok=<optimized out>, 
    devmatch_id=<optimized out>, devmatch_ok=<optimized out>, crc_ok=1 '\001', trx_done=<optimized out>)
    at /home/scottml/osf/zmp-osf/zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:3850
#12 isr () at /home/scottml/osf/zmp-osf/zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:4393
#13 0x0001c9fc in radio_nrf5_isr_body () at /home/scottml/osf/zmp-osf/zephyr/subsys/bluetooth/controller/ll_sw/ll.c:83
#14 radio_nrf5_isr () at /home/scottml/osf/zmp-osf/zephyr/subsys/bluetooth/controller/ll_sw/ll.c:81
#15 <signal handler called>
#16 __pendsv () at /home/scottml/osf/zmp-osf/zephyr/arch/arm/core/swap_helper.S:104
#17 <signal handler called>
#18 0x00000006 in ?? ()
---Type <return> to continue, or q <return> to quit---
#19 0x000000fe in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@mike-scott mike-scott reopened this Jun 5, 2018
@mike-scott
Copy link
Contributor Author

@pfalcon @andyross ^^

@andyross andyross modified the milestones: v1.12.0, v1.13.0 Jun 5, 2018
@andyross
Copy link
Contributor

andyross commented Jun 5, 2018

Cool, thanks. Will rework for 1.13 and bug you to retest on the rig when it's ready.

@andyross andyross added the priority: high High impact/importance bug label Jun 5, 2018
@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label Jun 5, 2018
@mike-scott
Copy link
Contributor Author

mike-scott commented Jun 5, 2018

Also here's the _kernel object: https://hastebin.com/ducusuxata.xml

This shows:

  • _current = 0x200027a0 <engine_thread_data> which is in LwM2M client using k_sleep(500)
  • timeout_q also shows <engine_thread_data> & <net_neighbor_pool> in waiting status
  • ready_q.cache is set to <_idle_thread_s>
  • the only thread in the runq is <prio_recv_thread_data>

Backtrace shows:

  • ISR woke up bluetooth controller
  • eventually do_sem_give() was called for <sem_prio_recv>
  • in do_sem_give():
    1. we called _ready_thread(<prio_recv_thread_data>)
    2. which calls _add_thread_to_ready_q(<prio_recv_thread_data>)
    3. which calls update_cache(0)
    4. which calls should_preempt()

Discussion points:
Q: Why the _current thread is still <engine_thread_data>?
A: Is it possible we had the ISR happen right in the middle of the k_sleep call after <engine_thread_data> was removed from the runq?

Q: Why is the readyq.cache showing <_idle_thread_s>?

This in the backtrace looks odd:

#15 <signal handler called>
#16 __pendsv () at /home/scottml/osf/zmp-osf/zephyr/arch/arm/core/swap_helper.S:104
#17 <signal handler called>

Were we in the middle of another __pendsv? when the ISR happened?

@mike-scott
Copy link
Contributor Author

@andyross I don't think the scheduler is working correctly (still). The k_sleep() patch merely cleaned up the logic in should_preempt. But, TBH the above circumstances should never happen.

@andyross
Copy link
Contributor

andyross commented Jun 5, 2018

Scheduling decisions should be atomic. I spent way too much time staring at the synchronization over the last too days and found nothing. But it's always possible to miss something. If you're wanting to hunt yourself, MOST of the synchronization is in the LOCKED() blocks and can be easily verified to be a sane critical section. The scarier bits are in the timeout management code which use traditional irq_lock().

(Now I know you're shouting "Ah hah!" here, since it's timeout code we're specifically looking at. And I agree. Except I couldn't see anything wrong. Nonetheless a cleanup is going to move the timeout stuff under the scheduler spinlock where it belongs, so a fix it at least plausible)

And surely a PendSV exception is not itself interruptible, no? I can't imagine how that would work if it was.

@mike-scott
Copy link
Contributor Author

AH nm, the issue with readyq.cache pointing to idle. We're already panicked and I'm looking at current state, not what it was prior to k_panic. I'll add a logging point to print readyq.cache before panic.

@mike-scott
Copy link
Contributor Author

@andyross as a follow-up. The above stack is completely legitimate as the k_sleep() thread was the _current thread when the IRQ hit.

I think the underlying problem is that the LwM2M engine service thread has too high of a priority for what it's doing. And it can't be preempted by most other threads.

@mike-scott
Copy link
Contributor Author

@andyross I think we have a problem with k_sleep() still. It has to do with this change:
5b37cd7#diff-7f89e4cbbdcbb89c82dd2cbb00343602

It assumes that all calls to __swap() go through PendSV. We know that k_sleep() doesn't pend() the thread like other paths. This means we're in an "interesting" state when we hit line 104 "msr BASEPRI, r0" in swap_helper.S doesn't it?

@mike-scott
Copy link
Contributor Author

mike-scott commented Jun 6, 2018

Actually the bug here (if there is one) is that _current isn't update till 2 lines AFTER the msr BASEPRI, r0 call. Meaning that an IRQ hitting once interrupts are re-enabled will trigger all of the ready_thread code and use the OLD _current thread value for comparison skipping the new_thread entirely.

@mike-scott
Copy link
Contributor Author

Hrm.. that's not true either.. we shouldn't be getting an interrupt at line 104 at all.

We don't turn back on interrupts till line 163: msr BASEPRI, r0

@mike-scott
Copy link
Contributor Author

@cvinayak for reference

@andyross andyross added priority: medium Medium impact/importance bug and removed priority: high High impact/importance bug labels Jun 28, 2018
@andyross
Copy link
Contributor

Drop priority as a workaround was commited for 1.12. The root cause (PendSV can be interrupted before entry, making _Swap() inherently non-atomic on ARM!) needs to be documented in a separate bug and a ARM-only workaround put in place for 1.13.

@mike-scott
Copy link
Contributor Author

@andyross I think this can be closed. Correct?

@nashif
Copy link
Member

nashif commented Aug 17, 2018

yes, lets close it

@nashif nashif closed this as completed Aug 17, 2018
@andyross
Copy link
Contributor

Yes, this is worked around. I really should write up a bug on the root cause issue (_Swap() on ARM is nonatomic because of the way pendsv works), and the code should cleanly reflect that and maybe implement it in an ARM-specific way instead of checking scheduler pending states needlessly. Someone please nag me to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants