Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][ADL-P] DSP panic on device boot-up or resume-from-suspend for sof-adl-max98357a-rt5682-2way tplg #6964

Open
johnylin76 opened this issue Jan 18, 2023 · 14 comments
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected chrome Chromebooks or ChromeOS P2 Critical bugs or normal features

Comments

@johnylin76
Copy link
Contributor

Describe the bug
ChromeOS device: Brya/Taniks (ADL-P)
ChromeOS image version: R111-15313.0.0

Under this OS version the backporting v5.10 kernel commit chain is landed.
The waves-integrated SOF build is generated from Google-internal rpl-001-drop-stable branch (checkout head: https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/5325552/1).
While issues being fixed on other devices e.g. Vell, Primus, we found the DSP panic is now observable on Taniks devices. And it seems to be (not strictly-verified) reproducible on Taniks only, and only after kernel backporting commits landing (it can pass by the latest SOF build + OS version 15308.0.0)

To Reproduce
Can be reproduced by Taniks (tplg: sof-adl-max98357a-rt5682-waves-2way). Verified that even if we removed waves (use sof-adl-max98357a-rt5682-2way.tplg instread) the issue is still reproducible.

  1. Flash OS image R111-15313.0.0
  2. DSP panic may be observed after device rebooted (not 100%)
  3. DSP panic can be ~100% reproducible after resume-from-suspend by command suspend_stress_test -c 1

Observation
To summarize the observation so far:

proposed fixes attached conf/tplg Core SPK Core DMIC48K Core DMIC16K/KWD DSP panic?
the present tplg sof-adl-max98357a-rt5682-waves-2way 1 1 0 Has DSP panic
tplg w/ removing Waves sof-adl-max98357a-rt5682-2way 1 1 0 Has DSP panic
tplg running on one core sof-adl-max98357a-rt5682-waves-2way-core0 0 0 0 NO
tplg w/ removing dmic16k/KWD sof-adl-max98357a-rt5682-waves-2way-nohotword 1 1 0 NO

(tplg/conf files can be found in the attached zip-file)

Impact
Audio broken on Taniks devices

Log Analysis

The observed DSP panic information is like the following:
image

Although the error was shown on DMIC0 (DMIC48K), by comparing between sof-loggers from the present tplg (left) and the one-core tplg (right), the error seems to start from the suspicious behavior while re-loading DMIC16K/KWD topology of the present tplg after resume-from-suspend. (logs can be found in the attached zip-file)

The following sof-loggers are extracted from the timing after resume-from-suspend. Arrows in yellow are starting points for loading a new pipeline. Both of the first loading pipe_id is 12 (KWD pipeline). Logs on the right (one-core tplg) show the next actions to create selector (arrow in green) and google-hotword-detect which are located on KWD pipeline, and then load pipe_id 11 (DMIC16K pipeline).
However, logs on the left (present tplg) show the different behavior which jumps to Core#1 for edf_scheduler_init (arrow in read) and then starts to load pipe_id 10 (DMIC48K pipeline). KWD/DMIC16K pipelines seem to be skipped loading since I didn't find in the following logs.
image

In the end of logs from the present tplg we can find the error message on ipc_comp_connect which leads to DSP issue like the example below (sink_id 59 stands for the selector located on KWD pipeline while source_id 60 is its source buffer).
image

From my perspective the missing logs for loading KWD/DMIC16K pipelines might be caused from the side effect of multi-core processing. However the ipc_comp_connect error shouldn't be expected which implies the defect by DSP recovering during suspend-resume. However, I have no idea why it is only observed on sof-adl-max98357a-rt5682-2way cases (w/ and w/o Waves). Would it happen to meet the corner case for AMP_SSP=2 or codec in TDM mode?

rpl-001-waves-dsp-panic-issue-taniks.zip

@johnylin76 johnylin76 added bug Something isn't working as expected chrome Chromebooks or ChromeOS ADL Applies to Alder Lake platform labels Jan 18, 2023
@sathyap-chrome
Copy link
Contributor

sathyap-chrome commented Jan 19, 2023

@johnylin76 similar issue i saw with suspend resume cycling on Skolas too ( sof-adl-max98360a-rt5682.tplg) Attach below dmesg logs.. Will perform more experiments.

I also think the DMIC16k loading is the cause from the minimal logs.

[  985.150618] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx timed out for 0x30110000 (msg/reply size: 12/12)
[  985.150629] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  985.150668] sof-audio-pci-intel-tgl 0000:00:1f.3: hda irq intsts 0x00000000 intlctl 0xc0000000 rirb 00
[  985.150671] sof-audio-pci-intel-tgl 0000:00:1f.3: dsp irq ppsts 0x00000000 adspis 0x00000000
[  985.150698] sof-audio-pci-intel-tgl 0000:00:1f.3: error: host status 0x00000000 dsp status 0x00000000 mask 0x00000003
[  985.150701] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  985.150704] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  985.150707] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware exception
[  985.150710] sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (6)
[  985.150735] sof-audio-pci-intel-tgl 0000:00:1f.3: status: fw entered - code 00000005
[  985.151265] sof-audio-pci-intel-tgl 0000:00:1f.3: unexpected fault 0x00000000 trace 0x00004000
[  985.151268] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------
**[  985.151274] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to free widget PIPELINE.10.DMIC0.IN
[  985.654502] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx timed out for 0x40010000 (msg/reply size: 76/12)**
[  985.654512] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware exception
[  985.654516] sof-audio-pci-intel-tgl 0000:00:1f.3: ctx_save ipc error -110, proceeding with suspend

@kv2019i

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 19, 2023

@johnylin76 @sathyap-chrome One suspicion is whether the SOF tree used here, has any components that are not safe w.r.t. to freeing widgets.

For some background, SOF kernel&FW have enabled dynamic loading of pipeline elements in upstream. As one requirement, kernel has started to free FW widgets (versus old FW versions where the free was not used at all, everything was static and FW power was just cut off with all state lost).

Recently we found issues with newer kernel and older 2.0-based SOF FW. This was fixed with patch merged as https://lore.kernel.org/r/20221101114913.1292671-1-kai.vehmanen@linux.intel.com and this should be present in ChromeOS kernels as well.

That patch used SOF2.2 as a cutoff point when the new flow is used. E.g. if FW is newer than 2.2.0, then FW is expected to cleanly implement the free widget flows. We have tests against stable-2.2 tree in upstream CI that regularly tests this for all kernel changes.

But, but, I now wonder whether the FW in this case has some module (or just a different type of code flow with existing code), that triggers on error. This FW branch is based on 2.2 SOF, so kernel will use the new flow.

A quick way to test for this hypothesis is to modify the version check in sof_ipc3_tear_down_all_pipelines() and to skip the widget free flow for 2.2.0 as well. So something like:

-                   SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 2, 0)) {
+                   SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 3, 0)) {

FYI to @ranj063 as well

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 20, 2023

I'm afraid the version in the firmware comes from parsing the output of git describe with a regular expression which is obviously super brittle:

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 20, 2023

@marc-hb wrote:

I'm afraid the version in the firmware comes from parsing the output of git describe with a regular expression which is obviously super brittle:

I'm all for moving to a plain-version-in-make (like Linux and many other projects do), but in this specific case, a vanilla FW built from rpl-001-drop-stable branch has:

$ git describe --tags --abbrev=12 --match v* --dirty
v2.2-59-g7f8e01dfc753

... so the version is correctly set and kernel will use the new flow (that assume FW can handle free_widget calls).

Vamshigopal pushed a commit to Vamshigopal/linux that referenced this issue Jan 24, 2023
Though tear down pipelines is well tested from v2.2, there could be
3P modules which need to support.

Hence we will see ipc error and DSP panic following tear down flow
So extending the upstream commit:
https://lore.kernel.org/all/20221101114913.1292671-1-kai.vehmanen@linux.intel.com/
to RPL001 FW too till we introduce dynamic pipelines and
all 3P are tested with dynamic pipelines.

Discussion: thesofproject/sof#6964

BUG=b:251144395
TEST=Verify suspend stress for 100 cycles

Change-Id: Id75eff542d3ed32c836e689ff84e7a0a67cd9cfb
Signed-off-by: Sathya Prakash M R <sathya.prakash.m.r@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/4179517
Tested-by: Sathya Prakash <sathya.prakash.m.r@intel.corp-partner.google.com>
Reviewed-by: Sathya Prakash <sathya.prakash.m.r@intel.corp-partner.google.com>
Reviewed-by: Curtis Malainey <cujomalainey@chromium.org>
Commit-Queue: Curtis Malainey <cujomalainey@chromium.org>
@kfrydryx kfrydryx added the P1 Blocker bugs or important features label Jan 24, 2023
@lgirdwood
Copy link
Member

@kfrydryx @kv2019i any status update ?

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 27, 2023

@lgirdwood wrote:

@kfrydryx @kv2019i any status update ?

@Vamshigopal @sathyap-chrome any update on this? I see there's a patch merged to chrome kernel tree...

@Vamshigopal
Copy link
Contributor

@kv2019i we have a patch by increasing the SOF_FW_VER check to 2.3 in chrome kernel tree as suggested by you.
Issue is not observed after that.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 30, 2023

A quick way to test for this hypothesis is to modify the version check in ...

--- a/sound/soc/sof/ipc3-topology.c
+++ b/sound/soc/sof/ipc3-topology.c
@@ -2330,7 +2330,7 @@ static int sof_ipc3_tear_down_all_pipelines(struct snd_sof_dev *sdev, bool verif
 
                /* Do not free widgets for static pipelines with FW older than SOF2.2 */
                if (!verify && !swidget->dynamic_pipeline_widget &&
-                   SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 2, 0)) {
+                   SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 3, 0)) {
                        mutex_lock(&swidget->setup_mutex);
                        swidget->use_count = 0;
                        mutex_unlock(&swidget->setup_mutex);

we have a patch by increasing the SOF_FW_VER check to 2.3 in chrome kernel tree as suggested by you/@kv2019i. Issue is not observed after that.

@Vamshigopal I believe @kv2019i's suggestion was only a temporary test hack to narrow down the issue, not a fix. I think it was only intended to point the finger at the code that frees widgets in order to understand the issue better; not to fix it. It fools the kernel into thinking the firmware is "old", I mean < 2.2. But that firmware branch is not actually old.

One suspicion is whether the SOF tree used here, has any components that are not safe w.r.t. to freeing widgets.

... except maybe for some component(s) which are actually "old"? In that case is such a mix of old and new firmware code safe? No idea sorry.

@Vamshigopal
Copy link
Contributor

@Vamshigopal I believe @kv2019i's suggestion was only a temporary test hack to narrow down the issue, not a fix. I think it was only intended to point the finger at the code that frees widgets in order to understand the issue better; not to fix it. It fools the kernel into thinking the firmware is "old", I mean < 2.2. But that firmware branch is not actually old.

Agree @marc-hb we took it as a temporary fix in chrome kernel
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/4179517

... except maybe for some component(s) which are actually "old"? In that case is such a mix of old and new firmware code safe? No idea sorry.

We have seen this issue when we have thirdpatry modules in pipeline, may be we should investigate more on why tp modules cannot free widgets .

@johnylin76
Copy link
Contributor Author

We have seen this issue when we have thirdpatry modules in pipeline, may be we should investigate more on why tp modules cannot free widgets .

I don't think it's exactly right. In my description above DSP panic issues are reproducible with and without Waves (third-party module). However it might have bad influence since we mostly found out this issue on 3P module-integrated devices.

@lgirdwood
Copy link
Member

@johnylin76 do we need the pipeline teardown feature for this device ? If not, then safest option is probably to disable it for the moment. This could be a use after free() situation. e.g. does the following patch make any difference ?

diff --git a/src/lib/alloc.c b/src/lib/alloc.c
index a42ba3729..27dc0a79a 100644
--- a/src/lib/alloc.c
+++ b/src/lib/alloc.c
@@ -523,6 +523,7 @@ static void free_block(void *ptr)
         * future, on top of the memory region now being used for
         * different purposes on another core.
         */
+       memset(ptr, 0, block_map->block_size * hdr->size);
        dcache_writeback_invalidate_region(ptr, block_map->block_size * hdr->size);
 
        heap_is_full = !block_map->free_count;

This would cause a NULL deref on any reuse after free().

@kv2019i kv2019i added P2 Critical bugs or normal features and removed P1 Blocker bugs or important features labels Feb 3, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Feb 3, 2023

@Vamshigopal @johnylin76 I'll lower priority as a workaround is available and deployed.

I think there are two ways to go about this in upstream:

  1. identify the module that is having trouble with the free_widget,
  2. modify upstream Linux kernel so that old flow is used also for SOF2.2

Now I'd prefer not to go via (2) approach. Avoiding this flow will significantly reduce validation coverage of pipeline setup/teardown paths, which are essential to enabling dynamic pipelines. And as SOF2.2 is our long-term support branch for cAVS2.5 hardware with IPC3, we'd like to keep covering this feature as well.

The issue does not happen in SOF main or SOF stable-v2.2, so this does point to some out-of-tree component. It seems Waves is not the cause, so this would point to some other delta.

@johnylin76
Copy link
Contributor Author

This issue probably has the same root cause as what @cujomalainey proposed the fix #7056 for.

Unfortunately we still got some user feedback with DSP panic encountered after we landed #6964 (comment) for skipping static pipeline tear-down and #6928 for running core re-distribution, although they considerably lowered the reproducing rate in most devices.

@lgirdwood
Copy link
Member

This issue probably has the same root cause as what @cujomalainey proposed the fix #7056 for.

Unfortunately we still got some user feedback with DSP panic encountered after we landed #6964 (comment) for skipping static pipeline tear-down and #6928 for running core re-distribution, although they considerably lowered the reproducing rate in most devices.

@mwasko fyi - fix for #7056 could help here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected chrome Chromebooks or ChromeOS P2 Critical bugs or normal features
Projects
None yet
Development

No branches or pull requests

7 participants