Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRDM-K64F boot hang w/ mcuboot + lwm2m client #8005

Closed
mike-scott opened this issue May 29, 2018 · 27 comments
Closed

FRDM-K64F boot hang w/ mcuboot + lwm2m client #8005

mike-scott opened this issue May 29, 2018 · 27 comments
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@mike-scott
Copy link
Contributor

Using latest master, I can produce a boot hang on K64F when using mcuboot + the lwm2m client chain loaded from slot 0. The interesting bit is that IPv6 needs to be disabled.

Bisecting led to this commit:
1856e22#diff-23bba5938c8ac686f219f01f58478b9aR152

And if I change line 152 from:

update_cache(0);

to

update_cache(1);

It allows the boot to finish (not saying this is a fix, just stating that this makes the boot hang go away).

Even more curious, if I flash the lwm2m client directly and avoid mcuboot, this issue doesn't happen.

The thread that's hanging is started via k_work_q_start() here:
https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/net/ip/net_tc.c#L343

TBH, I don't understand the scheduler changes enough to really debug this further.

@mike-scott
Copy link
Contributor Author

@andyross Looking for some guidance in the best way to debug this further.

@mike-scott
Copy link
Contributor Author

This bug has a VERY racy feel to it. When I add printk for debugging it can make the problem go away as well.

@mike-scott
Copy link
Contributor Author

Adding @agross-linaro for mild interest.

@andyross
Copy link
Contributor

If I had to guess, it's likely a bad assumption in the app or maybe network code. The change to the scheduler there was that there were circumstances in the new scheduler (and apparently the older one?) where a thread at cooperative (i.e. non-preemptible) priority would release the CPU and switch to a different thread when it wasn't supposed to. Cooperative threads are per documentation not supposed to be descheduled until they deliberately pend or yield.

The change you made allows for any thread being added to the ready queue (i.e. basically everything -- threads being woken up, or having their priorities change, or being created, etc...) to preempt whatever was already running, which is wrong.

So... I'm guessing that somewhere there's an assumption that after a wakeup and/or thread creation, that the new high priority thread will have "already run", but it doesn't because the thread that is already running is at a cooperative priority. (And of course this kind of thing is absolutely going to be racy, which is exactly what you see).

Try listing out the threads in the app changing one at a time to a preemptible priority, and seeing if you can identify the specific spot where a thread is waking up and "not running right away as the app/library expects".

@andyross
Copy link
Contributor

Oh, and the fix there might be as simple as stuffing a k_yield() call in to let the current thread chill for a bit.

@mike-scott
Copy link
Contributor Author

There are only 2 other threads running when the work_q_start() is called. One is the main system thread.

@andyross
Copy link
Contributor

I'm looking at this and getting stuck seeing how you could possible get a hang inside work_q_start() because of NON-preemption. Can you post the .config file for the app? Or just grep out anything with "PRIO" in it?

@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label May 29, 2018
@mike-scott
Copy link
Contributor Author

@andyross I agree that this bug makes no sense.

I tracked down where the init was hanging and ended up in net_tc.c work q init functions. I then looked at the work_q_main (thread entry function) for a work queue and this change seems to have made a difference:

diff --git a/kernel/work_q.c b/kernel/work_q.c
index 6121f2c3e..19148ec23 100644
--- a/kernel/work_q.c
+++ b/kernel/work_q.c
@@ -28,6 +28,7 @@ static void work_q_main(void *work_q_ptr, void *p2, void *p3)
 
                work = k_queue_get(&work_q->queue, K_FOREVER);
                if (!work) {
+                       k_yield();
                        continue;
                }

However, I'm not sure why. I've also tried adding a printk there and it doesn't seem to be hitting the if clause at all.

@mike-scott
Copy link
Contributor Author

EDIT: I'm wrong it is coming back with NULL work:

***** Booting Zephyr OS 1.12.0-rc1 *****
[MCUBOOT] [INF] main: Starting bootloader
[MCUBOOT] [INF] boot_status_source: Image 0: magic=unset, copy_done=0xff, image_ok=0xff
[MCUBOOT] [INF] boot_status_source: Scratch: magic=unset, copy_done=0xe, image_ok=0xff
[MCUBOOT] [INF] boot_status_source: Boot source: slot 0
[MCUBOOT] [INF] boot_swap_type: Swap type: none
[MCUBOOT] [INF] main: Bootloader chainload address offset: 0x20000
[MCUBOOT] [INF] main: Jumping to the first image slot
[dev/eth_mcux] [DBG] eth_0_init: MAC 00:04:9f:87:c4:36
work:0x2000786c
work:0x2000786c
work:0x2000786c
work:0x00000000
work:0x00000000
work:0x00000000
work:0x00000000
[repeats endlessly]

@mike-scott
Copy link
Contributor Author

Using a patch like this:

index 6121f2c3e..831224a17 100644
--- a/kernel/work_q.c
+++ b/kernel/work_q.c
@@ -27,6 +27,7 @@ static void work_q_main(void *work_q_ptr, void *p2, void *p3)
                k_work_handler_t handler;
 
                work = k_queue_get(&work_q->queue, K_FOREVER);
+printk("work:%p\n", work);
                if (!work) {
                        continue;
                }

@pfalcon
Copy link
Contributor

pfalcon commented May 30, 2018

Copied from #8012 (comment) :

Note those lines (if (!work) continue) were added in c1fa82b, for the purpose of supporting cancellation of work queue items. Cancellation is a valid usecase when k_queue_get() and friends can return NULL when called with K_FOREVER. If you have another case, similar to, #4358, it needs separate consideration.

@andyross
Copy link
Contributor

Referencing @pfalcon's question from #8012

In default config, the main thread is preemptive. And it's fair to assume that the net subsys initialization should be "atomic". And running this initialization on preemptive main thread is rather not ideal. We not only should remove k_yield, we should do more - prevent preemption.

Our terminology is a little odd: most OSes use the term "preemption" to refer to an interrupt-driven context switch, but we use it also to mean a thread being scheduled away before it reaches a natural stopping point (i.e. going to sleep, pending, or yielding). In a cooperative Zephyr thread, creating a new thread, calling k_sem_give() to wake up a sleeper, etc.. will not result in the kernel switching away to the new thread even if it is nominally a higher priority. Apparently there were some bugs in the old schedulers behavior here that allows this, and the recent fix to be stricter has exposed some changes.

All that said: "preventing preemption" isn't and shouldn't be the goal here. If you want those drivers and subsystems to work in an SMP context, cooperative threads stop being a useful synchronization technique entirely because even if your CPU doesn't run code unexpectedly, the other one will.

Really this just looks like it was a simple race. The driver is delivering packets too early somehow, and that's not something you can fix with thread priorities in the general case (e.g. they might be delivered out of an ISR). All I meant with that comment was that different apps can have different priorities and scheduling behavior for the system initialization thread and so might interact with the network initialization code in different ways.

@mike-scott
Copy link
Contributor Author

@andyross @pfalcon this is an example boot log of -rc2 where the scheduler ends up going a bit crazy:
https://hastebin.com/zasuhepixi.log

Diff patch looks like this for adding the logging: https://hastebin.com/sakicuvuse.cpp

You can see near the end of the log that the net_tc TX workq eventually gets into a work==NULL loop and the __swap() call keeps putting it back to the same workq_thread. Even a k_yield() in work_q_main() doesn't help.

I'm still looking at how this ends up happening.

@mike-scott
Copy link
Contributor Author

NOTE: The work==NULL loop happens well after the network is initialized and is sending out data.

@mike-scott
Copy link
Contributor Author

Pinging @jukkar for network subsystem knowledge

@mike-scott
Copy link
Contributor Author

For reference, the same application running on -rc1 with a similar set of logging added:
https://hastebin.com/weropocapi.log

@andyross
Copy link
Contributor

FWIW: the inability to yield out of the work queue thread's loop is correct. It's a high priority cooperative thread. You can yield, but that just allows the scheduler to pick another thread at the same or higher priority to run. After a k_yield() the original thread remains runnable, and under no circumstances will we ever choose to run a thread when there is one available at a higher priority.

It's the spinning that's the problem. A high priority thread spinning on the CPU is a deadlock condition by design.

@nashif nashif added the priority: medium Medium impact/importance bug label May 31, 2018
@mike-scott mike-scott reopened this May 31, 2018
@mike-scott
Copy link
Contributor Author

@nashif there are more issues to this bug.

@nashif
Copy link
Member

nashif commented May 31, 2018

you need then to open multiple bugs, having a bug entry with multiple issues will make it difficult to manage the release.

@mike-scott
Copy link
Contributor Author

@nashif ok, I'll open a more generic bug indicating we have interoperability issues with:

  • the scheduler rewrite denying preempt on high priority threads (before it didn't)
  • k_poll path in k_queue_get() is not tested in our tests. However, any network enabled sample will have optional k_poll path in k_queue_get() enabled due to select POLL under config NETWORKING.
  • net_tc workq's are marked as high priority threads and are now basically not preemptable.

End result is that the bug is not fixed yet. Previous patch hid the issue a bit better.

@mike-scott
Copy link
Contributor Author

It's almost like the k_poll pend'd thread is waking up mistakenly as a next ready thread.

@nashif
Copy link
Member

nashif commented May 31, 2018

the scheduler rewrite denying preempt on high priority threads (before it didn't)

is this thread cooperative? It should not be preempted by higher priority threads, it runs to the end and can only be preempted by ISRs.

Edit: that is also commented on by Andy above...

@mike-scott
Copy link
Contributor Author

Yes the net_tc TX/RX workq threads are marked cooperative.

@mike-scott
Copy link
Contributor Author

The issue is as @andyross says, the workq thread is spinning in work_q_main where it wakes up without having been handled (a NULL work is returned with -EAGAIN error from _swap). And the loop tries k_queue_get() again and wakes up again (rinse repeat).

@mike-scott
Copy link
Contributor Author

@andyross Should we add a check for thread_status POLLING somewhere in update_cache()? To avoid setting the readyq cache to the same thread if its in k_poll()?

@mike-scott
Copy link
Contributor Author

Closing this issue and continuing the discussion in:
#8049

@pfalcon
Copy link
Contributor

pfalcon commented Jun 4, 2018

@andyross: Thanks for the response.

All that said: "preventing preemption" isn't and shouldn't be the goal here.

I bet it isn't. I treat it as a debugging/exploratory measure - if there's a suspicious of thread concurrency issues, then making threads involved all cooperative would be the easiest (and very coarse of course) way to synchronize them.

But funny thing, that while with the speculative analysis I see the concurrency issues in the IP stack, switching the main thread to cooperative (as TX, RX, and system workq threads are already such) doesn't affect the behavior of the practical issues I face (like e.g. #7831). Of course, the explanation would be that there're still concurrency issues, but before they show up, a more low-hanging issues in the stack show up.

That's why I'm interested in @mike-scott's case - it seems that there concurrency issues and preemption does make a difference (or even show bugs in preemption).

For reference, my WIP notes on the IP stack concurrency here: https://github.com/pfalcon/zephyr/wiki/IPStackWalkthru

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants