-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][ADL-P] DSP panic on device boot-up or resume-from-suspend for sof-adl-max98357a-rt5682-2way tplg #6964
Comments
@johnylin76 similar issue i saw with suspend resume cycling on Skolas too ( sof-adl-max98360a-rt5682.tplg) Attach below dmesg logs.. Will perform more experiments. I also think the DMIC16k loading is the cause from the minimal logs.
|
@johnylin76 @sathyap-chrome One suspicion is whether the SOF tree used here, has any components that are not safe w.r.t. to freeing widgets. For some background, SOF kernel&FW have enabled dynamic loading of pipeline elements in upstream. As one requirement, kernel has started to free FW widgets (versus old FW versions where the free was not used at all, everything was static and FW power was just cut off with all state lost). Recently we found issues with newer kernel and older 2.0-based SOF FW. This was fixed with patch merged as https://lore.kernel.org/r/20221101114913.1292671-1-kai.vehmanen@linux.intel.com and this should be present in ChromeOS kernels as well. That patch used SOF2.2 as a cutoff point when the new flow is used. E.g. if FW is newer than 2.2.0, then FW is expected to cleanly implement the free widget flows. We have tests against stable-2.2 tree in upstream CI that regularly tests this for all kernel changes. But, but, I now wonder whether the FW in this case has some module (or just a different type of code flow with existing code), that triggers on error. This FW branch is based on 2.2 SOF, so kernel will use the new flow. A quick way to test for this hypothesis is to modify the version check in sof_ipc3_tear_down_all_pipelines() and to skip the widget free flow for 2.2.0 as well. So something like: - SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 2, 0)) {
+ SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 3, 0)) { FYI to @ranj063 as well |
I'm afraid the version in the firmware comes from parsing the output of |
@marc-hb wrote:
I'm all for moving to a plain-version-in-make (like Linux and many other projects do), but in this specific case, a vanilla FW built from rpl-001-drop-stable branch has: $ git describe --tags --abbrev=12 --match v* --dirty ... so the version is correctly set and kernel will use the new flow (that assume FW can handle free_widget calls). |
Though tear down pipelines is well tested from v2.2, there could be 3P modules which need to support. Hence we will see ipc error and DSP panic following tear down flow So extending the upstream commit: https://lore.kernel.org/all/20221101114913.1292671-1-kai.vehmanen@linux.intel.com/ to RPL001 FW too till we introduce dynamic pipelines and all 3P are tested with dynamic pipelines. Discussion: thesofproject/sof#6964 BUG=b:251144395 TEST=Verify suspend stress for 100 cycles Change-Id: Id75eff542d3ed32c836e689ff84e7a0a67cd9cfb Signed-off-by: Sathya Prakash M R <sathya.prakash.m.r@intel.com> Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/4179517 Tested-by: Sathya Prakash <sathya.prakash.m.r@intel.corp-partner.google.com> Reviewed-by: Sathya Prakash <sathya.prakash.m.r@intel.corp-partner.google.com> Reviewed-by: Curtis Malainey <cujomalainey@chromium.org> Commit-Queue: Curtis Malainey <cujomalainey@chromium.org>
@lgirdwood wrote: @Vamshigopal @sathyap-chrome any update on this? I see there's a patch merged to chrome kernel tree... |
@kv2019i we have a patch by increasing the SOF_FW_VER check to 2.3 in chrome kernel tree as suggested by you. |
--- a/sound/soc/sof/ipc3-topology.c
+++ b/sound/soc/sof/ipc3-topology.c
@@ -2330,7 +2330,7 @@ static int sof_ipc3_tear_down_all_pipelines(struct snd_sof_dev *sdev, bool verif
/* Do not free widgets for static pipelines with FW older than SOF2.2 */
if (!verify && !swidget->dynamic_pipeline_widget &&
- SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 2, 0)) {
+ SOF_FW_VER(v->major, v->minor, v->micro) < SOF_FW_VER(2, 3, 0)) {
mutex_lock(&swidget->setup_mutex);
swidget->use_count = 0;
mutex_unlock(&swidget->setup_mutex);
@Vamshigopal I believe @kv2019i's suggestion was only a temporary test hack to narrow down the issue, not a fix. I think it was only intended to point the finger at the code that frees widgets in order to understand the issue better; not to fix it. It fools the kernel into thinking the firmware is "old", I mean < 2.2. But that firmware branch is not actually old.
... except maybe for some component(s) which are actually "old"? In that case is such a mix of old and new firmware code safe? No idea sorry. |
Agree @marc-hb we took it as a temporary fix in chrome kernel
We have seen this issue when we have thirdpatry modules in pipeline, may be we should investigate more on why tp modules cannot free widgets . |
I don't think it's exactly right. In my description above DSP panic issues are reproducible with and without Waves (third-party module). However it might have bad influence since we mostly found out this issue on 3P module-integrated devices. |
@johnylin76 do we need the pipeline teardown feature for this device ? If not, then safest option is probably to disable it for the moment. This could be a use after free() situation. e.g. does the following patch make any difference ? diff --git a/src/lib/alloc.c b/src/lib/alloc.c
index a42ba3729..27dc0a79a 100644
--- a/src/lib/alloc.c
+++ b/src/lib/alloc.c
@@ -523,6 +523,7 @@ static void free_block(void *ptr)
* future, on top of the memory region now being used for
* different purposes on another core.
*/
+ memset(ptr, 0, block_map->block_size * hdr->size);
dcache_writeback_invalidate_region(ptr, block_map->block_size * hdr->size);
heap_is_full = !block_map->free_count; This would cause a NULL deref on any reuse after free(). |
@Vamshigopal @johnylin76 I'll lower priority as a workaround is available and deployed. I think there are two ways to go about this in upstream:
Now I'd prefer not to go via (2) approach. Avoiding this flow will significantly reduce validation coverage of pipeline setup/teardown paths, which are essential to enabling dynamic pipelines. And as SOF2.2 is our long-term support branch for cAVS2.5 hardware with IPC3, we'd like to keep covering this feature as well. The issue does not happen in SOF main or SOF stable-v2.2, so this does point to some out-of-tree component. It seems Waves is not the cause, so this would point to some other delta. |
This issue probably has the same root cause as what @cujomalainey proposed the fix #7056 for. Unfortunately we still got some user feedback with DSP panic encountered after we landed #6964 (comment) for skipping static pipeline tear-down and #6928 for running core re-distribution, although they considerably lowered the reproducing rate in most devices. |
|
Describe the bug
ChromeOS device: Brya/Taniks (ADL-P)
ChromeOS image version: R111-15313.0.0
Under this OS version the backporting v5.10 kernel commit chain is landed.
The waves-integrated SOF build is generated from Google-internal rpl-001-drop-stable branch (checkout head: https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/5325552/1).
While issues being fixed on other devices e.g. Vell, Primus, we found the DSP panic is now observable on Taniks devices. And it seems to be (not strictly-verified) reproducible on Taniks only, and only after kernel backporting commits landing (it can pass by the latest SOF build + OS version 15308.0.0)
To Reproduce
Can be reproduced by Taniks (tplg: sof-adl-max98357a-rt5682-waves-2way). Verified that even if we removed waves (use sof-adl-max98357a-rt5682-2way.tplg instread) the issue is still reproducible.
suspend_stress_test -c 1
Observation
To summarize the observation so far:
(tplg/conf files can be found in the attached zip-file)
Impact
Audio broken on Taniks devices
Log Analysis
The observed DSP panic information is like the following:
Although the error was shown on DMIC0 (DMIC48K), by comparing between sof-loggers from the present tplg (left) and the one-core tplg (right), the error seems to start from the suspicious behavior while re-loading DMIC16K/KWD topology of the present tplg after resume-from-suspend. (logs can be found in the attached zip-file)
The following sof-loggers are extracted from the timing after resume-from-suspend. Arrows in yellow are starting points for loading a new pipeline. Both of the first loading pipe_id is 12 (KWD pipeline). Logs on the right (one-core tplg) show the next actions to create selector (arrow in green) and google-hotword-detect which are located on KWD pipeline, and then load pipe_id 11 (DMIC16K pipeline).
However, logs on the left (present tplg) show the different behavior which jumps to Core#1 for edf_scheduler_init (arrow in read) and then starts to load pipe_id 10 (DMIC48K pipeline). KWD/DMIC16K pipelines seem to be skipped loading since I didn't find in the following logs.
In the end of logs from the present tplg we can find the error message on ipc_comp_connect which leads to DSP issue like the example below (sink_id 59 stands for the selector located on KWD pipeline while source_id 60 is its source buffer).
From my perspective the missing logs for loading KWD/DMIC16K pipelines might be caused from the side effect of multi-core processing. However the ipc_comp_connect error shouldn't be expected which implies the defect by DSP recovering during suspend-resume. However, I have no idea why it is only observed on sof-adl-max98357a-rt5682-2way cases (w/ and w/o Waves). Would it happen to meet the corner case for AMP_SSP=2 or codec in TDM mode?
rpl-001-waves-dsp-panic-issue-taniks.zip
The text was updated successfully, but these errors were encountered: