Skip to content

tests: add test for context switch from ISR #16284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

maribu
Copy link
Member

@maribu maribu commented Apr 6, 2021

Contribution description

The new tests triggers context switching from ISR with a high load
with enabled stack tests. If the context switching code does not disable
IRQs until a context switch is fully completed, an IRQ might be served
while not all the saved context is consumed from the stack. Normally,
this is not an issue, as the context switch is just resumed after the
ISR completes. However, under high load data on the stack can accumulate
and eventually overflow.

This tests puts the board under high IRQ load to see if such stack
overflows can happen.

Testing procedure

make BOARD=<SOME_BOARD> -C tests/isr_context_switch flash test

Issues/PRs references

Came up in #16211

@maribu maribu added Area: tests Area: tests and testing framework Type: new feature The issue requests / The PR implemements a new feature for RIOT labels Apr 6, 2021
@fjmolinas
Copy link
Contributor

Looks good, can you add a README to the test? Something like your PR description would be enough IMO.

@fjmolinas
Copy link
Contributor

All my boards passed the test, I'll ACK once there is a small README, please trigger the CI afterward @maribu

@fjmolinas fjmolinas added Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines labels Apr 7, 2021
@fjmolinas
Copy link
Contributor

ping @maribu

@maribu
Copy link
Member Author

maribu commented Apr 20, 2021

Sorry for the delay. Sadly, there is indeed a bug in the AVR context switch implementation, as @nandojve pointed out, that this test doesn't trigger.

I'll try to spice up the test first. @nandojve found a way to do so, but it depends on a platform being able to trigger IRQs on GPIOs by configuring them as output and controlling them. While this works fine on AVR, I don't really want to have this test relying on this.

@nandojve: I think this was actually the approach used for context switching some time ago for AVR, but it resulting in a GPIO pin and a precious external interrupt being used. It was actually quite smart, as the context pushed to stack would be identical regardless whether it was triggered from within an ISR, or synchron by a "soft IRQ". Reverting to this implementation would solve the issue again. I have some ideas on how to fake a soft IRQ without relying on real hardware. This would solve the issue and use avrlibc for saving and restoring the context, getting rid of a lot of inline assembly in RIOT. It would also ease adding new AVR boards that are already supported by the avrlibc.

@maribu
Copy link
Member Author

maribu commented Apr 20, 2021

I'll rebase on master to get make compile-commands in this branch.

@nandojve
Copy link
Contributor

I'll rebase on master to get make compile-commands in this branch.

@maribu , I would like remind that if there is a possibility to use nested ISR I would say IRQ0 should be the answer. Why? when user enable IRQ any high priority can preempt current one. If context switch is on IRQ0 ( highest priority ) no one can preempt it and the bug is solved for that scenario too.

@maribu maribu force-pushed the tests/isr_context_switch branch from 98d2f6e to f6d119f Compare April 22, 2021 06:09
@maribu
Copy link
Member Author

maribu commented Apr 22, 2021

OK, I basically rewrote the test. This test is no able to confirm the issue for ATxmega board. However, I still fail to trigger the stack overflow on an Arduino Mega2560.

Since I had to use a hardware timer directly to trigger the issue reliably (well, for my atxmega-a1u-xpro at least), the test ended up being a bit more painful. It might be needed to adjust some test parameters for some platforms to get the test working :-(

@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@benpicco benpicco added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Jul 20, 2021
@github-actions github-actions bot added the Area: doc Area: Documentation label Jul 21, 2021
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@kaspar030
Copy link
Contributor

still relevant I guess

@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Mar 2, 2022
@maribu maribu added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels May 30, 2022
@maribu maribu force-pushed the tests/isr_context_switch branch 5 times, most recently from 8d0ecb3 to e15b929 Compare May 30, 2022 10:37
@maribu
Copy link
Member Author

maribu commented May 30, 2022

Looks like on the samr21-xpro the test actually found an issue

@crasbe
Copy link
Contributor

crasbe commented Apr 2, 2025

Tested successfully with an arduino-uno and nrf52dk target. On the seeedstudio-xiao-nrf52840, this test will cause the device to crash and return to the bootloader.

This is what I get in hterm, I did not press anything.

Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
{ "threads": [{ "name": "t1", "stack_size":

Edit: On an nrf52840dongle the test runs successfully:

/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-02_18.04.36-tests_isr_context_switch-nrf52840dongle" --no-reconnect --noprefix --no-repeat-command-on-empty-line 
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 468}]}
{ "threads": [{ "name": "t2", "stack_size": 256, "stack_used": 168}]}
{ "threads": [{ "name": "t1", "stack_size": 256, "stack_used": 168}]}
s

make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/isr_context_switch“ wird verlassen

@crasbe
Copy link
Contributor

crasbe commented Apr 2, 2025

Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).

Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.

The new tests triggers context switching from ISR with a high load
with enabled stack tests. If the context switching code does not disable
IRQs until a context switch is fully completed, an IRQ might be served
while not all the saved context is consumed from the stack. Normally,
this is not an issue, as the context switch is just resumed after the
ISR completes. However, under high load data on the stack can accumulate
and eventually overflow.

This tests puts the board under high IRQ load to see if such stack
overflows can happen.
@maribu maribu force-pushed the tests/isr_context_switch branch from e15b929 to 40f905a Compare April 2, 2025 19:37
@riot-ci
Copy link

riot-ci commented Apr 2, 2025

Murdock results

✔️ PASSED

eb1d7a3 fixup! tests/core: add test for context switching from ISR

Success Failures Total Runtime
25 0 25 01m:49s

Artifacts

@maribu
Copy link
Member Author

maribu commented Apr 2, 2025

Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).

Yeah, with the PR being that old, it really makes sense to rebase it. I just did so now to make use of the "new" test folder structure and the periph_timer_query_freqs feature.

Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.

Since this is more about the implementation of the context switching, I would expect all MCUs of a given architecture to either fail or all to pass, unless there are bugs in the test or in the periph_timer implementation.

This is why I'm more than a bit puzzled about the test failure on the seeedstudio-xiao-nrf52840, when it does work on other boards with the same MCU.

Looks like on the samr21-xpro the test actually found an issue

It just seems to be slow:

2025-04-02 21:42:45,922 # Help: Press s to start test, r to print it is ready
s
2025-04-02 21:42:48,281 # START
2025-04-02 21:42:48,288 # main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
2025-04-02 21:42:48,293 # Testing 500 context switches triggered from ISR
2025-04-02 21:42:48,296 # INFO: timer running at 1000000 Hz
2025-04-02 21:43:20,986 # TEST PASSED
2025-04-02 21:43:20,993 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 424 }]}

@maribu
Copy link
Member Author

maribu commented Apr 2, 2025

Whoopsie:

main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
timeout short: 10, timeout long: 1000
scheduler(): stack overflow detected, pid=2
*** RIOT kernel panic:
STACK OVERFLOW

*** halted.

Inside isr -2

@maribu
Copy link
Member Author

maribu commented Apr 2, 2025

Interesting: Prefixing make with FEATURES_BLACKLIST=no_idle_thread solves the stack over flow on the samr21-xpro.

@crasbe
Copy link
Contributor

crasbe commented Apr 3, 2025

I'm not sure what changed, but now the test passes on the seeedstudio-xiao-nrf52840, but it still resets into the bootloader after the test:

/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-03_11.41.43-tests_isr_context_switch-seeedstudio-xiao-nrf52840" --no-reconnect --noprefix --no-repeat-command-on-empty-line 
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
START
main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-tests/isr_context_switch)
Testing 5000 context switches triggered from ISR
INFO: timer running at 16000000 Hz.
Timeout at start: 125, incrementing timeout in steps of: 31
s
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 504}]}
Serial port disconnected, waiting to get reconnected...
Try to reconnect to /dev/ttyACM0 again...
Reconnected to serial port /dev/ttyACM0
Timeout in expect script at "child.expect("Testing 500 context switches triggered from ISR")" (tests/core/isr_context_switch/tests/01-run.py:20)

make: *** [/home/chris/RIOTstuff/riot-isr-context/RIOT/makefiles/tests/tests.inc.mk:26: test] Fehler 1
make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/core/isr_context_switch“ wird verlassen

Perhaps it has to do with stdio_cdc_acm? Many tests blacklist that.



def testfunc(child):
child.expect("Testing 500 context switches triggered from ISR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
child.expect("Testing 500 context switches triggered from ISR")
child.expect(r"Testing 500+ context switches triggered from ISR")

otherwise it passes on adafruit-feather-nrf52840-sense, but also resets to the bootloader.

Testing Context Switches from ISR under Load
============================================

This tests launches two threads, t1 and t2, which each try to lock its own mutex for 500 times. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This tests launches two threads, t1 and t2, which each try to lock its own mutex for 500 times. A
This tests launches two threads, t1 and t2, which each try to lock its own mutex for 500 or 5000 times, depending on the MCU. A

@maribu
Copy link
Member Author

maribu commented Apr 10, 2025

I will sadly have to change the test here again.

I'm able to cause a stack overflow when slightly changing the timing for pretty much any Cortex-M board. I would like to have this test fail reliably on every Cortex-M board, as long as the issue is there.

But the fact that this test can trigger the stack overflow if it hits the right timing in the context switch shows that the underlying strategy is working. I'll probably have to sweep with the timing gently over a wide range of timeouts and for each do enough repetitions to cause a stack overflow, if the stack would grow due to context switching. But that does sound like a pretty expensive test then.

@mguetschow
Copy link
Contributor

Maybe some more evidence: I've noticed that with BOARD=nrf52840dk the thread stacksize info would be printed for t1 and t2 when using stdio_cdc_acm (and a second USB cable), but not without it.

With

diff --git a/tests/core/isr_context_switch/main.c b/tests/core/isr_context_switch/main.c
index 13399c0a06..e522a85dc0 100644
--- a/tests/core/isr_context_switch/main.c
+++ b/tests/core/isr_context_switch/main.c
@@ -81,6 +81,7 @@ static void *t1_impl(void *unused)
     for (unsigned repetitions = 0; repetitions < TEST_REPETITIONS; repetitions++) {
         mutex_lock(&sig_t1);
     }
+    puts("t1 FINISHED");
     return NULL;
 }
 
@@ -91,6 +92,7 @@ static void *t2_impl(void *unused)
         mutex_lock(&sig_t2);
     }
     mutex_unlock(&sig_main);
+    puts("t2 FINISHED");
     return NULL;
 }

I now get no stacksize information any longer (and no extra output) with USEMODULE=stdio_cdc_acm, and without it:

2025-04-10 14:14:35,418 # START
2025-04-10 14:14:35,423 # main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-HEAD)
2025-04-10 14:14:35,427 # Testing 5000 context switches triggered from ISR
2025-04-10 14:14:35,431 # INFO: timer running at 16000000 Hz.
2025-04-10 14:14:35,436 # Timeout at start: 125, incrementing timeout in steps of: 31
2025-04-10 14:14:38,130 # TEST PASSED
2025-04-10 14:14:38,135 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 396 }]}
2025-04-10 14:14:38,138 # t2*** RIOT kernel panic:
2025-04-10 14:14:38,139 # MEM MANAGE HANDLER
2025-04-10 14:14:38,139 # 
2025-04-10 14:14:38,140 # *** halted.
2025-04-10 14:14:38,141 # 
2025-04-10 14:14:38,142 # Inside isr -12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: doc Area: Documentation Area: tests Area: tests and testing framework CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Type: new feature The issue requests / The PR implemements a new feature for RIOT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants