Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [multi-core] FW panic when stream is tested on core 1 on MTL RVP #7108

Closed
RanderWang opened this issue Feb 16, 2023 · 33 comments
Closed
Assignees
Labels
bug Something isn't working as expected IPC4 Issues observed with IPC4 (same IPC as Windows) MTL Applies to Meteor Lake platform mtl-005 multicore Issues observed when not only core#0 is used. P1 Blocker bugs or important features
Milestone

Comments

@RanderWang
Copy link
Collaborator

RanderWang commented Feb 16, 2023

When Core 1 is enabled by topology for stream 1, test the following cmd: aplay works for the first time but is failed at the second time

root@sh-mtlp-rvp-nocodec-04:~# aplay 48K_Let_It_Go.wav -Dhw:0,1 -d 3;sleep 1;aplay 48K_Let_It_Go.wav -Dhw:0,1 -d 3
Playing WAVE '48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
aplay: pcm_write:2127: write error: Input/output error

No issue on TGL RVP

FW: branch sof main and mtl-003-stable
Kernel: branch sof-dev + thesofproject/linux#4198
kernel_log.txt
trace.txt

Kernel log:

[   98.517121] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[   98.517996] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[   98.518220] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[   98.518229] Message payload: 00000000: 00000002 00000002
[   98.518233] snd_sof:snd_sof_dsp_core_get: sof-audio-pci-intel-mtl 0000:00:1f.3: Core 1 powered up
[   98.518237] snd_sof:snd_sof_dsp_core_get: sof-audio-pci-intel-mtl 0000:00:1f.3: Core ref 1 
[   98.518243] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x11000005|0x100000: GLB_CREATE_PIPELINE
[   99.025987] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x11000005|0x100000
[   99.026000] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[   99.026015] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x91000005|0x100000|0x0, target: 0x1b060000|0x0|0x0, ctl: 0x3
[   99.026021] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[   99.026024] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[   99.026027] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[   99.026030] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[   99.026041] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[   99.026044] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[   99.026052] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[   99.026054] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[   99.026059] sof-audio-pci-intel-mtl 0000:00:1f.3: failed to create module pipeline.3
[   99.026069] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[   99.528976] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[   99.528990] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[   99.528998] Message payload: 00000000: 00000002 00000000
[   99.529004] sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to set up connected widgets

FW log:

[00:00:03.565,590] <inf> dai_intel_ssp: dai_ssp_pm_runtime_dis_ssp_power: dai_ssp_pm_runtime_dis_ssp_power index 1
[00:00:03.565,595] <inf> dai_intel_ssp: dai_ssp_pm_runtime_dis_ssp_power: dai_ssp_pm_runtime_dis_ssp_power I2SLCTL
[00:00:03.566,001] <inf> ipc: ipc_cmd: rx	: 0x47000000|0x0
[00:00:04.583,588] <inf> ipc: ipc_cmd: rx	: 0x47000000|0x0
[00:00:04.584,705] <inf> ipc: ipc_cmd: rx	: 0x11000005|0x100000
[00:00:04.584,728] <err> os:  ** FATAL EXCEPTION
[00:00:04.584,735] <err> os:  ** CPU 1 EXCCAUSE 9 (load/store alignment)
[00:00:04.584,740] <err> os:  **  PC 0xa006c3ac VADDR 0x60313
[00:00:04.584,745] <err> os:  **  PS 0x50f20
[00:00:04.584,760] <err> os:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:15 CALLINC:1)
[00:00:04.584,768] <err> os:  **  A0 0xa006c3d8  SP 0xa00e7820  A2 0x60313  A3 0x60323
[00:00:04.584,775] <err> os:  **  A4 0x6006c4fe  A5 0xa00ec540  A6 0xa00e781c  A7 0x400f14fc
[00:00:04.584,781] <err> os:  **  A8 0x6006c510  A9 0xa00ec520 A10 0xa00e781c A11 0x10
[00:00:04.584,788] <err> os:  ** A12 (nil) A13 0x400de6a0 A14 0xfffffffe A15 0x1
[00:00:04.584,793] <err> os:  ** LBEG 0xa0037126 LEND 0xa0037137 LCOUNT 0xa00683e8
[00:00:04.584,798] <err> os:  ** SAR 0x1


Backtrace:0xa006c3a9:0xa00e7820 0xa006c3d5:0xa00e7850 0xa006790c:0xa00e7880 0xa0067ced:0xa00e78a0 0xa0045594:0xa00e78d0 0xa0034e16:0xa00e7900 0xa003311b:0xa00e7930 

[00:00:04.584,853] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
[00:00:04.584,858] <err> os: Current thread: 0x400f0b88 (unknown)
@RanderWang RanderWang added bug Something isn't working as expected MTL Applies to Meteor Lake platform IPC4 Issues observed with IPC4 (same IPC as Windows) P1 Blocker bugs or important features labels Feb 16, 2023
@RanderWang
Copy link
Collaborator Author

Found a hint in kernel driver, try it first

@RanderWang
Copy link
Collaborator Author

SW control core 0 only, other core is controlled by fw, so it is not related to driver.

@RanderWang
Copy link
Collaborator Author

a simple debug found that the second core was enabled but exception happen when it process idc interrupt.

@mszleszy
Copy link

Confirmed in firmware-automated-testing.fat framework.
I'm passing logs and repro instruction to dev team

@RanderWang
Copy link
Collaborator Author

RanderWang commented Mar 20, 2023

SOF and Zephyr PRs are under review:
#7174
zephyrproject-rtos/zephyr#55182

@lgirdwood lgirdwood added this to the v2.6 milestone Mar 21, 2023
@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Mar 22, 2023

Another defect fix PR that may related to this problem
#7322

@wszypelt wszypelt self-assigned this Mar 31, 2023
@wszypelt
Copy link

wszypelt commented Apr 14, 2023

@RanderWang the potential fix is ready, does this issue still reproduces?
Can you confirm and close?

@RanderWang
Copy link
Collaborator Author

RanderWang commented Apr 18, 2023

@wszypelt which branch ? I pull latest main branch and still got FW panic. test case stream 0,0 on core 0, stream 0,1 on core 1: aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,0 ;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,1;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,0 ;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,1;

[00:00:22.786,053] os: ** FATAL EXCEPTION
[00:00:22.786,060] os: ** CPU 1 EXCCAUSE 9 (load/store alignment)
[00:00:22.786,063] os: ** PC 0xa0042c5d VADDR 0x9
[00:00:22.786,066] os: ** PS 0x60c20
[00:00:22.786,070] os: ** (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:12 CALLINC:2)
[00:00:22.786,090] cbprintf_package: (unsigned) char * used for %p argument. It's recommended to cast it to void * because it may cause misbehavior in certain configurations. String:" ** A0 %p SP %p A2 %p A3 %p" argument:1
[00:00:22.786,095] os: ** A0 0xa0042d3c SP 0xa0099790 A2 0xa00a7940 A3 0x3e8
[00:00:22.786,098] os: ** A4 (nil) A5 (nil) A6 0xa00ac340 A7 0xa00997f0
[00:00:22.786,101] os: ** A8 0x1 A9 0xa00a77c0 A10 0x1 A11 (nil)
[00:00:22.786,105] os: ** A12 0xffffffff A13 (nil) A14 (nil) A15 0x1
[00:00:22.786,108] os: ** LBEG 0xa0032885 LEND 0xa003288b LCOUNT (nil)
[00:00:22.786,111] os: ** SAR 0x20

Backtrace:0xa0042c5a:0xa0099790 0xa0042d39:0xa00997c0 0xa00433d2:0xa00997f0 0xa00467ff:0xa0099860 0xa0064191:0xa0099970 0xa0043c90:0xa00999d0 0xa00354c6:0xa0099a00 0xa0033b23:0xa0099a30

[00:00:22.786,175] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
[00:00:22.786,181] os: Current thread: 0x400a21b8 (unknown)
[00:00:22.788,095] os: Halting system

@wszypelt
Copy link

wszypelt commented Apr 24, 2023

@RanderWang Please correct me if I'm wrong, but wasn't the first problem only with the secondary core? and the above example is related to the changing from main core to secondary?

@RanderWang
Copy link
Collaborator Author

@RanderWang Please correct me if I'm wrong, but wasn't the first problem only with the secondary core? and the above example is related to the changing from main core to secondary?

yes, you are right. @wszypelt do you suggest create another issue for core switch ?

@wszypelt
Copy link

@RanderWang yes, in this case I would like to ask you to create a new issue if possible

@RanderWang
Copy link
Collaborator Author

@wszypelt I tested the following cmd on core 1 : speaker-test -c 2 -Dhw:0,1 -d 5;speaker-test -c 2 -Dhw:0,1 -d 5; The FW panic can happen easily. The issue still valid

@wszypelt
Copy link

wszypelt commented Apr 28, 2023

Reproduction also occurs in our tests:
pytest -s "tests/conv/fw_000_basic/test_01_multicore_host.py::TestMulticoreHostTopology::test_000_01_pipeline_on_secondary_core_stress[16000Hz_16in16bit_2ch-core_1]"

@lgirdwood lgirdwood modified the milestones: v2.6, v2.7 May 10, 2023
@RanderWang
Copy link
Collaborator Author

RanderWang commented May 12, 2023

MTL-005 is failed with core 0, core 1 switch

Test recipe:
• Kernel: sof-dev/6694b85ddceae (CPC fixes included)
• SOF: mtl-005-drop-stable/36a7a817fb8
• TPLG: mtl-005-drop-stable + PR#7582

@RanderWang
Copy link
Collaborator Author

My test script. It is failed in a very few loops in most time

root@sh-mtlp-rvp-nocodec-04:~# ./sleep.sh 
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
aplay: set_params:1416: Unable to install hw params:
ACCESS:  RW_INTERLEAVED
FORMAT:  S16_LE
SUBFORMAT:  STD
SAMPLE_BITS: 16
FRAME_BITS: 32
CHANNELS: 2
RATE: 48000
PERIOD_TIME: (85333 85334)
PERIOD_SIZE: 4096
PERIOD_BYTES: 16384
PERIODS: 4
BUFFER_TIME: (341333 341334)
BUFFER_SIZE: 16384
BUFFER_BYTES: 65536
TICK_TIME: 0

#!/bin/bash

for i in $(seq 1 100)
do
	aplay ~/48K_Let_It_Go.wav -d 1;
	if [ $? != 0 ]; then
		break;
	fi

	aplay ~/48K_Let_It_Go.wav -d 1 -Dhw:0,1;
	if [ $? != 0 ]; then
		break;
	fi

	aplay ~/48K_Let_It_Go.wav -d 1 -Dhw:0,2;
	if [ $? != 0 ]; then
		break;
	fi
done

@RanderWang
Copy link
Collaborator Author

Error always happens with SET_DX for core 1 or core 2

[ 540.062143] snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-mtl 0000:00:1f.3: copier copier.SSP.4.1, IPC size is 188
[ 540.062152] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 540.563706] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[ 540.563782] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 540.563842] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x47000000|0x0|0x0, target: 0x1b060000|0x0|0x0, ctl: 0x3
[ 540.563941] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[ 540.563985] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 540.564032] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[ 540.564064] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[ 540.564116] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[ 540.564395] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[ 540.564499] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[ 540.564544] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 540.564619] Message payload: 00000000: 00000002 00000002
[ 540.564626] sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed to enable target core for widget pipeline.3
[ 540.564683] sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to set up connected widgets
[ 540.564743] sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed widget list set up for pcm 1 dir 0
[ 540.564800] sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_hw_params on 0000:00:1f.3: -110
[ 540.564926] Port1: ASoC: error at __soc_pcm_hw_params on Port1: -110
[ 540.564989] Port1: ASoC: error at dpcm_fe_dai_hw_params on Port1: -110

@RanderWang
Copy link
Collaborator Author

If the core2 is not involved in the test, sometimes the test can pass, sometimes it is failed after 30+ cycles

31
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
32
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
33
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
aplay: set_params:1416: Unable to install hw params:
ACCESS:  RW_INTERLEAVED
FORMAT:  S16_LE
SUBFORMAT:  STD
SAMPLE_BITS: 16
FRAME_BITS: 32
CHANNELS: 2
RATE: 48000
PERIOD_TIME: (85333 85334)
PERIOD_SIZE: 4096
PERIOD_BYTES: 16384
PERIODS: 4
BUFFER_TIME: (341333 341334)
BUFFER_SIZE: 16384
BUFFER_BYTES: 65536
TICK_TIME: 0

@RanderWang
Copy link
Collaborator Author

RanderWang commented May 17, 2023

Made progress: if mtrace is being using, the the stress test can pass. This makes the issue more complex

@RanderWang
Copy link
Collaborator Author

RanderWang commented May 19, 2023

The new updated mtl-005 (f2fec74) makes thing even worse that multicore feature completely can't work now. The original 005 can't pass stress test but now it is failed at the first try

@RanderWang
Copy link
Collaborator Author

I see the latest update for set_pipeline_state, but now topology setting make sure that all the pipelines for one stream are on single core, and different streams uses different core. So the change is no useful now. Now the error happens with SET_DX

@mengdonglin
Copy link
Collaborator

I see the latest update for set_pipeline_state, but now topology setting make sure that all the pipelines for one stream are on single core, and different streams uses different core. So the change is no useful now. Now the error happens with SET_DX

@RanderWang Is the problem you observe on mtl-005 (f2fec74) same as #7649?

@abonislawski
Copy link
Member

abonislawski commented May 19, 2023

@RanderWang on 005 multiple set state will now work on all configurations like primary+2x secondary or only single secondary or only several secondary cores

@RanderWang
Copy link
Collaborator Author

Make great progress, with this change the 005 branch can pass the stress test (the change is based on windows FW). Can any one to check it and improve it ?

diff --git a/soc/xtensa/intel_adsp/ace/multiprocessing.c b/soc/xtensa/intel_adsp/ace/multiprocessing.c
index 0d26fc6139..b966f40236 100644
--- a/soc/xtensa/intel_adsp/ace/multiprocessing.c
+++ b/soc/xtensa/intel_adsp/ace/multiprocessing.c
@@ -124,6 +124,8 @@ void soc_start_core(int cpu_num)
	/* Setting the Power Active bit to the off state before powering up the core. This step is
	 * required by the HW if we are starting core for a second time. Without this sequence, the
	 * core will not power on properly when doing transition D0->D3->D0.
	 */
	DSPCS.capctl[cpu_num].ctl &= ~DSPCS_CTL_SPA;

	/* Checking current power status of the core. */
	while (((DSPCS.capctl[cpu_num].ctl & DSPCS_CTL_CPA) == DSPCS_CTL_CPA)) {
		k_busy_wait(HW_STATE_CHECK_DELAY);
	}

+	DSPCS.bootctl[cpu_num].bctl |= DSPBR_BCTL_WAITIPCG ;

	DSPCS.capctl[cpu_num].ctl |= DSPCS_CTL_SPA;

	/* Waiting for power up */
	while (((DSPCS.capctl[cpu_num].ctl & DSPCS_CTL_CPA) != DSPCS_CTL_CPA) &&
	       (retry > 0)) {
		k_busy_wait(HW_STATE_CHECK_DELAY);
		retry--;
	}

@RanderWang
Copy link
Collaborator Author

RanderWang commented May 19, 2023

@RanderWang on 005 multiple set state will now work on all configurations like primary+2x secondary or only single secondary or only several secondary cores

@abonislawski can you check my comments ? thanks!
@marcinszkudlinski @serhiy-katsyuba-intel @kv2019i @lgirdwood FYI.

@RanderWang
Copy link
Collaborator Author

No such issue on TGL, as we know both cavs & ace share most code, only difference in hw register setting.

@RanderWang
Copy link
Collaborator Author

@abonislawski I got such warning with latest 005-branch

-- Generated new /home/rander/zephyr/build-mtl/zephyr/include/generated/sof_versions.h
[77/105] Performing build step for 'smex_ep'
ninja: no work to do.
[80/105] Building C object modules/sof/CMakeFiles/modules_sof.dir/home/rander/zephyr/sof/src/idc/idc.c.obj
/home/rander/zephyr/sof/src/idc/idc.c:290:8: warning: implicit declaration of function 'set_pipeline_state' is invalid in C99 [-Wimplicit-function-declaration]
        ret = set_pipeline_state(ppl_icd, cmd, &delayed);
              ^
1 warning generated.

@abonislawski
Copy link
Member

@abonislawski can you check my comments ? thanks!

On 005 we enabled clock gating by default (WAITIPCG), looks like it should be disabled in multicore scenario? @tmleman

@abonislawski
Copy link
Member

@RanderWang I looked at this and looks like DSPBR_BCTL_WAITIPCG is already enabled for secondary core in soc_mp_startup(), so the difference is when it is enabled, before or after power up. Potentially this could be our issue.
I moved this registry setup out of soc_mp_startup() to earliest moment in soc_start_core() (like in your test change), please verify this test branch, we can push this to 005 if it will pass
https://github.com/abonislawski/sof/tree/mtl-005-drop-stable
@keqiaozhang please also verify #7649 on this branch

@RanderWang
Copy link
Collaborator Author

@abonislawski With your branch, the stress test will be failed after device reboot, but pass stress test after reloading driver.

@RanderWang
Copy link
Collaborator Author

According to @keqiaozhang test, Current Main branch can pass the stress test without any change.

@abonislawski
Copy link
Member

@RanderWang clock gating is disabled on main branch, you can revert this from 005 and it will work too

@abonislawski
Copy link
Member

Verified today 2 working solutions:

  1. Rander proposal with DSPBR_BCTL_WAITIPCG without removing WAITIPCG soc_mp_startup()
  2. Set DSPBR_BCTL_WAITIPCG for core 0 when more than one core used

For now pushed Rander proposal to my 005 branch but on main we need to develop correct clock gating management.
https://github.com/abonislawski/sof/tree/mtl-005-drop-stable

@RanderWang
Copy link
Collaborator Author

@abonislawski Thanks! I tested for a few hours. 005branch can pass my test script. But I found a new way to reproduce the bug: power off the mtl rvp and boot up it after a few seconds, then I can reproduce it at the first test. I will set up another bug for this behavior. We can close this bug for current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected IPC4 Issues observed with IPC4 (same IPC as Windows) MTL Applies to Meteor Lake platform mtl-005 multicore Issues observed when not only core#0 is used. P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

7 participants