-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
I just submitted #35301, which will assert on the case where a thread with interrupts locked tries to context switch. This is an error, but it has traditionally worked in Zephyr and we haven't been able to detect it. Now we can.
But it broke in one spot in the test suite on qemu_cortex_a53_smp. And the symptom turned out to be that the ztest "1cpu" setup routine (which is a system call when USERSPACE=y) was coming out of the trap handler with masked interrupts. The function then creates a highest priority thread to "hold" a CPU, which I guess wants to preempt the current thread[1] and thus tries to context switch. Which hits the new warning.
It's a little suspicious that this got hit only in that one spot out of the whole test suite[2], so there may be more complexity here than I've diagnosed. But the symptom seems really clear. At the top of the system call I can do e.g. __ASSERT(arch_irq_unlocked(arch_irq_lock()), "") and watch it fail. Which is definitely 100% wrong. System calls need to operate with interrupts unmasked, otherwise the kernel becomes a giant latency trap.
[1] This might be something to look at too: on x86 and xtensa SMP, the IPI announcing that new thread to the other CPU almost always gets picked up synchronously before the current thread reaches the schedule point. It's extremely rare to find a high priority thread spawned that tries to preempt the current thread. Maybe IPIs aren't being delivered promptly?
[2] And in fact that use of 1cpu on the test case was needless anyway, so I removed it. We don't even need to work around this ARM64 issue right now.