-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zephyr/wrapper.c: dereference NULL in POSIX for better k_panic() trace #8886
Conversation
As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in POSIX mode has various shortcomings that do not provide a useful trace. Useless pointers to signal handlers or other cleanup routines are printed instead. Leverage our already existing k_sys_fatal_error_handler() and dereference a NULL pointer there when in POSIX mode. This "fails fast" and provides a complete and relevant stack trace in CI when fuzzing or when using some other static analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in CI results thanks to this commit: ``` ./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each. Running: ./rballoc_align_fuzz_crash *** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 *** [00:00:00.000,000] <inf> main: SOF on native_posix [00:00:00.000,000] <inf> main: SOF initialized @ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391 [00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic [00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown) [00:00:00.000,000] <err> zephyr: Halting emulation AddressSanitizer:DEADLYSIGNAL ================================================================= ==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 ==1784402==The signal is caused by a WRITE memory access. ==1784402==Hint: address points to the zero page. #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19 #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3 thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16 thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11 thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11 thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8 thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9 thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2 thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9 thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9 thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16 thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3 thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2 ``` Signed-off-by: Marc Herbert <marc.herbert@intel.com>
They shouldn't do much but they can't hurt. See thesofproject#8632 for more details. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Real-world demo: https://github.com/thesofproject/sof/actions/runs/8072264170/job/22053627730?pr=8887 Funny how hard this can be in general: https://github.com/scottt/debugbreak |
One known failure in https://sof-ci.01.org/sofpr/PR8886/build2954/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50, everything else ACE is green. One suspend/resume failure in https://sof-ci.01.org/sofpr/PR8886/build2955/devicetest/index.html, everything else green. This PR does not affect hardware. @lrudyX , @wszypelt , ALL PUBLISH steps in all QB runs are currently failing like this:
https://sof-ci.01.org/sof-pr-viewer/#/build/PR8886/build13640547 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how the "answer" to all things in this context is also the cause of a crash. Nice choice on magic numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would indeed be good to have it solved properly in Zephyr...
* cleanup routine. See Zephyr POSIX limitations discussed in: | ||
* https://github.com/zephyrproject-rtos/zephyr/pull/68494 | ||
*/ | ||
*_sof_fatal_null = 42; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually do it with something as simple as *(int *)NULL = 0;
Any reason this wouldn't be enough? Never failed me until now, except on some particular hardware platforms like #6093
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizers and modern code flow analysis tend to make a mess of code that does undefined things. Putting it in an extern guarantees (at least until LTO lands) that the compiler can't "fix" this. The fundamental problem is that the desired behavior here isn't in fact portable: there's no way under the C standard to "crash in a way that ASAN detects before the OS".
@marc-hb generete json shouldn't be a problem for the PR(of course we'll probably fix it today/tomorrow anyway). The blocker is the unstable LNL test machine. Due to this, tests were not performed on it and it failed PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tend to agree that the right spot for this is in the Zephyr platform layers, but it's a complicated thing to explain to anyone not actually looking at fuzzer output. And this is simple and obvious, and certainly app-defined error handling is a Zephyr feature too.
Ack - @andyross @nashif I guess Zephyr can own this as a future feature and we can transition to Zephyr version when its' ready. |
Zephyr could probably make printing stack traces in POSIX easier but to be fair the incomplete stack trace on SIGABRT is a libFuzzer problem. It's quite complicated, long story in zephyrproject-rtos/zephyr#68494 |
2 commits. Main one:
As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure #8832 would have looked like in
CI results thanks to this commit: