-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [multi-core] FW panic when stream is tested on core 1 on MTL RVP #7108
Comments
Found a hint in kernel driver, try it first |
SW control core 0 only, other core is controlled by fw, so it is not related to driver. |
a simple debug found that the second core was enabled but exception happen when it process idc interrupt. |
Confirmed in firmware-automated-testing.fat framework. |
SOF and Zephyr PRs are under review: |
Another defect fix PR that may related to this problem |
@RanderWang the potential fix is ready, does this issue still reproduces? |
@wszypelt which branch ? I pull latest main branch and still got FW panic. test case stream 0,0 on core 0, stream 0,1 on core 1: aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,0 ;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,1;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,0 ;aplay 48K_Let_It_Go.wav -d 5 -Dhw:0,1; [00:00:22.786,053] os: ** FATAL EXCEPTION Backtrace:0xa0042c5a:0xa0099790 0xa0042d39:0xa00997c0 0xa00433d2:0xa00997f0 0xa00467ff:0xa0099860 0xa0064191:0xa0099970 0xa0043c90:0xa00999d0 0xa00354c6:0xa0099a00 0xa0033b23:0xa0099a30 [00:00:22.786,175] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1 |
@RanderWang Please correct me if I'm wrong, but wasn't the first problem only with the secondary core? and the above example is related to the changing from main core to secondary? |
yes, you are right. @wszypelt do you suggest create another issue for core switch ? |
@RanderWang yes, in this case I would like to ask you to create a new issue if possible |
@wszypelt I tested the following cmd on core 1 : speaker-test -c 2 -Dhw:0,1 -d 5;speaker-test -c 2 -Dhw:0,1 -d 5; The FW panic can happen easily. The issue still valid |
Reproduction also occurs in our tests: |
MTL-005 is failed with core 0, core 1 switch Test recipe: |
My test script. It is failed in a very few loops in most time
#!/bin/bash
for i in $(seq 1 100)
do
aplay ~/48K_Let_It_Go.wav -d 1;
if [ $? != 0 ]; then
break;
fi
aplay ~/48K_Let_It_Go.wav -d 1 -Dhw:0,1;
if [ $? != 0 ]; then
break;
fi
aplay ~/48K_Let_It_Go.wav -d 1 -Dhw:0,2;
if [ $? != 0 ]; then
break;
fi
done |
Error always happens with SET_DX for core 1 or core 2 [ 540.062143] snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-mtl 0000:00:1f.3: copier copier.SSP.4.1, IPC size is 188 |
If the core2 is not involved in the test, sometimes the test can pass, sometimes it is failed after 30+ cycles 31
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
32
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
33
Playing WAVE '/root/48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
aplay: set_params:1416: Unable to install hw params:
ACCESS: RW_INTERLEAVED
FORMAT: S16_LE
SUBFORMAT: STD
SAMPLE_BITS: 16
FRAME_BITS: 32
CHANNELS: 2
RATE: 48000
PERIOD_TIME: (85333 85334)
PERIOD_SIZE: 4096
PERIOD_BYTES: 16384
PERIODS: 4
BUFFER_TIME: (341333 341334)
BUFFER_SIZE: 16384
BUFFER_BYTES: 65536
TICK_TIME: 0 |
Made progress: if mtrace is being using, the the stress test can pass. This makes the issue more complex |
The new updated mtl-005 (f2fec74) makes thing even worse that multicore feature completely can't work now. The original 005 can't pass stress test but now it is failed at the first try |
I see the latest update for set_pipeline_state, but now topology setting make sure that all the pipelines for one stream are on single core, and different streams uses different core. So the change is no useful now. Now the error happens with SET_DX |
@RanderWang Is the problem you observe on mtl-005 (f2fec74) same as #7649? |
@RanderWang on 005 multiple set state will now work on all configurations like primary+2x secondary or only single secondary or only several secondary cores |
Make great progress, with this change the 005 branch can pass the stress test (the change is based on windows FW). Can any one to check it and improve it ? diff --git a/soc/xtensa/intel_adsp/ace/multiprocessing.c b/soc/xtensa/intel_adsp/ace/multiprocessing.c
index 0d26fc6139..b966f40236 100644
--- a/soc/xtensa/intel_adsp/ace/multiprocessing.c
+++ b/soc/xtensa/intel_adsp/ace/multiprocessing.c
@@ -124,6 +124,8 @@ void soc_start_core(int cpu_num)
/* Setting the Power Active bit to the off state before powering up the core. This step is
* required by the HW if we are starting core for a second time. Without this sequence, the
* core will not power on properly when doing transition D0->D3->D0.
*/
DSPCS.capctl[cpu_num].ctl &= ~DSPCS_CTL_SPA;
/* Checking current power status of the core. */
while (((DSPCS.capctl[cpu_num].ctl & DSPCS_CTL_CPA) == DSPCS_CTL_CPA)) {
k_busy_wait(HW_STATE_CHECK_DELAY);
}
+ DSPCS.bootctl[cpu_num].bctl |= DSPBR_BCTL_WAITIPCG ;
DSPCS.capctl[cpu_num].ctl |= DSPCS_CTL_SPA;
/* Waiting for power up */
while (((DSPCS.capctl[cpu_num].ctl & DSPCS_CTL_CPA) != DSPCS_CTL_CPA) &&
(retry > 0)) {
k_busy_wait(HW_STATE_CHECK_DELAY);
retry--;
} |
@abonislawski can you check my comments ? thanks! |
No such issue on TGL, as we know both cavs & ace share most code, only difference in hw register setting. |
@abonislawski I got such warning with latest 005-branch
|
On 005 we enabled clock gating by default (WAITIPCG), looks like it should be disabled in multicore scenario? @tmleman |
@RanderWang I looked at this and looks like DSPBR_BCTL_WAITIPCG is already enabled for secondary core in soc_mp_startup(), so the difference is when it is enabled, before or after power up. Potentially this could be our issue. |
@abonislawski With your branch, the stress test will be failed after device reboot, but pass stress test after reloading driver. |
According to @keqiaozhang test, Current Main branch can pass the stress test without any change. |
@RanderWang clock gating is disabled on main branch, you can revert this from 005 and it will work too |
Verified today 2 working solutions:
For now pushed Rander proposal to my 005 branch but on main we need to develop correct clock gating management. |
@abonislawski Thanks! I tested for a few hours. 005branch can pass my test script. But I found a new way to reproduce the bug: power off the mtl rvp and boot up it after a few seconds, then I can reproduce it at the first test. I will set up another bug for this behavior. We can close this bug for current state. |
When Core 1 is enabled by topology for stream 1, test the following cmd: aplay works for the first time but is failed at the second time
root@sh-mtlp-rvp-nocodec-04:~# aplay 48K_Let_It_Go.wav -Dhw:0,1 -d 3;sleep 1;aplay 48K_Let_It_Go.wav -Dhw:0,1 -d 3
Playing WAVE '48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
Playing WAVE '48K_Let_It_Go.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
aplay: pcm_write:2127: write error: Input/output error
No issue on TGL RVP
FW: branch sof main and mtl-003-stable
Kernel: branch sof-dev + thesofproject/linux#4198
kernel_log.txt
trace.txt
Kernel log:
FW log:
The text was updated successfully, but these errors were encountered: