Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zephyr/wrapper.c: dereference NULL in POSIX for better k_panic() trace #8886

Merged
merged 2 commits into from
Feb 28, 2024

Conversation

marc-hb
Copy link
Collaborator

@marc-hb marc-hb commented Feb 27, 2024

2 commits. Main one:

As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure #8832 would have looked like in
CI results thanks to this commit:

./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each.
Running: ./rballoc_align_fuzz_crash
*** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 ***
[00:00:00.000,000] <inf> main: SOF on native_posix
[00:00:00.000,000] <inf> main: SOF initialized
@ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic
[00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown)
[00:00:00.000,000] <err> zephyr: Halting emulation
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
==1784402==The signal is caused by a WRITE memory access.
==1784402==Hint: address points to the zero page.
    #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19
    #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3
    #2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16
    #3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11
    #4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11
    #5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8
    #6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9
    #7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2
    #8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9
    #9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9
    #10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16
    #11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3
    #12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2

As discussed in the alternative approach
zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in
CI results thanks to this commit:

```
./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each.
Running: ./rballoc_align_fuzz_crash
*** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 ***
[00:00:00.000,000] <inf> main: SOF on native_posix
[00:00:00.000,000] <inf> main: SOF initialized
@ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic
[00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown)
[00:00:00.000,000] <err> zephyr: Halting emulation
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
==1784402==The signal is caused by a WRITE memory access.
==1784402==Hint: address points to the zero page.
    #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19
    #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3
    thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16
    thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11
    thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11
    thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8
    thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9
    thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2
    thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9
    thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9
    thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16
    thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3
    thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2
```

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
They shouldn't do much but they can't hurt.

See thesofproject#8632 for more details.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 27, 2024

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 27, 2024

One known failure in https://sof-ci.01.org/sofpr/PR8886/build2954/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50, everything else ACE is green.

One suspend/resume failure in https://sof-ci.01.org/sofpr/PR8886/build2955/devicetest/index.html, everything else green.

This PR does not affect hardware.

@lrudyX , @wszypelt , ALL PUBLISH steps in all QB runs are currently failing like this:

10:33:55,890 ERROR - Step 'master>PUBLISH LOGS>GENERATE build.json' is failed: java.lang.IllegalArgumentException: The provided File object is not a directory: /localdisk/processteam/qba3/workspace/24733/Logs/dwmix/stdout.log

https://sof-ci.01.org/sof-pr-viewer/#/build/PR8886/build13640547

Copy link
Member

@cujomalainey cujomalainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the "answer" to all things in this context is also the cause of a crash. Nice choice on magic numbers.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would indeed be good to have it solved properly in Zephyr...

* cleanup routine. See Zephyr POSIX limitations discussed in:
* https://github.com/zephyrproject-rtos/zephyr/pull/68494
*/
*_sof_fatal_null = 42;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do it with something as simple as *(int *)NULL = 0; Any reason this wouldn't be enough? Never failed me until now, except on some particular hardware platforms like #6093

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizers and modern code flow analysis tend to make a mess of code that does undefined things. Putting it in an extern guarantees (at least until LTO lands) that the compiler can't "fix" this. The fundamental problem is that the desired behavior here isn't in fact portable: there's no way under the C standard to "crash in a way that ASAN detects before the OS".

@wszypelt
Copy link

wszypelt commented Feb 28, 2024

@marc-hb generete json shouldn't be a problem for the PR(of course we'll probably fix it today/tomorrow anyway). The blocker is the unstable LNL test machine. Due to this, tests were not performed on it and it failed PRs.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tend to agree that the right spot for this is in the Zephyr platform layers, but it's a complicated thing to explain to anyone not actually looking at fuzzer output. And this is simple and obvious, and certainly app-defined error handling is a Zephyr feature too.

@lgirdwood
Copy link
Member

Tend to agree that the right spot for this is in the Zephyr platform layers, but it's a complicated thing to explain to anyone not actually looking at fuzzer output. And this is simple and obvious, and certainly app-defined error handling is a Zephyr feature too.

Ack - @andyross @nashif I guess Zephyr can own this as a future feature and we can transition to Zephyr version when its' ready.

@lgirdwood lgirdwood merged commit 0fee9d9 into thesofproject:main Feb 28, 2024
37 of 44 checks passed
@marc-hb marc-hb deleted the posix-null-kpanic branch February 28, 2024 17:01
@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 28, 2024

Ack - @andyross @nashif I guess Zephyr can own this as a future feature and we can transition to Zephyr version when its' ready.

Zephyr could probably make printing stack traces in POSIX easier but to be fair the incomplete stack trace on SIGABRT is a libFuzzer problem.

It's quite complicated, long story in zephyrproject-rtos/zephyr#68494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants