Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu/cortexm_common: replace irq_restore by __set_PRIMASK for stm32l152re #11919

Merged
merged 1 commit into from
Aug 5, 2019

Conversation

fjmolinas
Copy link
Contributor

@fjmolinas fjmolinas commented Jul 25, 2019

Contribution description

This PR was split from #11830.

When #7385 was introduced when going to sleep (i.e. calling __WFI()), irq_enable was changed to irq_restore and that broke stm32l152re.

#8518 fixed the issue by introducing a __NOP() after __WFI(). This fixed the issue until #11159 where the call to cortexm_sleep was changed and now __NOP() didn't fix the issue but instead somehow triggered it.

After debugging in #11820 it was discovered that the changes in #7385 didn't actually break the code, but broke the code only when DBGMCU_CR_DBG_STANDBY | DBGMCU_CR_DBG_STOP | DBGMCU_CR_DBG_SLEEP where enabled. By default openocd sets these bits after an examine-end event. This is done by default for all stm32 boards.

https://github.com/ntfreak/openocd/blob/263deb3802a515ba8155b6c59146f0f539de4e43/tcl/target/stm32l1.cfg#L93-L100

This could be fixed by 510ffce so those bits are set only when gdb is launched. No need to enable debugging in other cases. When debugging this issue was still present.

Multiple changes/fixes where attempted in #11820 to fix the issue. Non of them sufficiently consistent or un-intrusive.

As for all I can tell, and as stated in #8518, the issue was the loss of the content of r0 when jumping to irq_restore which resulted in a corrupted pc trying to execute instructions out of the flash region.

2019-07-09 17:38:58,314 - INFO # Attempting to reconstruct state for debugging...
2019-07-09 17:38:58,315 - INFO # In GDB:
2019-07-09 17:38:58,316 - INFO #   set $pc=0x7822d5e
2019-07-09 17:38:58,317 - INFO #   frame 0
2019-07-09 17:38:58,317 - INFO #   bt
2019-07-09 17:38:58,318 - INFO # 
2019-07-09 17:38:58,319 - INFO # ISR stack overflowed by at least 16 bytes.

Since the problem was the branch, with 5d96127 irq_restore was replaced by __set_PRIMASK(state); which inlines the function call avoiding the jump and the whole issue all together.

The disassembly:

 80006e6:       f023 0304       bic.w   r3, r3, #4
 80006ea:       6113            str     r3, [r2, #16]
 80006ec:       f7ff ffac       bl      8000648 <irq_disable>
 80006f0:       f3bf 8f4f       dsb     sy
 80006f4:       bf30            wfi
 80006f6:       f380 8810       msr     PRIMASK, r0
 80006fa:       b33c            cbz     r4, 800074c <pm_set+0x74>
 80006fc:       e8bd 4010       ldmia.w sp!, {r4, lr}
 8000700:       f000 bc82       b.w     8001008 <stmclk_init_sysclk>
 8000704:       4b13            ldr     r3, [pc, #76]   ; (8000754 <pm_set+0x7c>)

Original disassembly:

 80006e4:       6913            ldr     r3, [r2, #16]
 80006e6:       f023 0304       bic.w   r3, r3, #4
 80006ea:       6113            str     r3, [r2, #16]
 80006ec:       f7ff ffac       bl      8000648 <irq_disable>
 80006f0:       f3bf 8f4f       dsb     sy
 80006f4:       bf30            wfi
 80006f6:       bf00            nop
 80006f8:       f7ff ffae       bl      8000658 <irq_restore>
 80006fc:       b33c            cbz     r4, 800074e <pm_set+0x76>
 80006fe:       e8bd 4010       ldmia.w sp!, {r4, lr}
 8000702:       f000 bc85       b.w     8001010 <stmclk_init_sysclk>

Testing procedure

  • To a simple check to verify the issue is fixed.

In master this hardfaults:

make -C tests/xtimer_drift BOARD=nucleo-l152re -j3 flash term

  • For the same command, verify the change robustness:

  • to be safe run python dist/tools/compile_and_test_for_board/compile_and_test_for_board.py --jobs 4 . nucleo-l152re, I got the following tests fails (expected fails)

ERROR:nucleo-l152re:Tests failed: 7
Failures during test:
- [tests/driver_grove_ledbar](tests/driver_grove_ledbar/test.failed)
- [tests/driver_my9221](tests/driver_my9221/test.failed)
- [tests/gnrc_ipv6_ext](tests/gnrc_ipv6_ext/test.failed)
- [tests/gnrc_rpl_srh](tests/gnrc_rpl_srh/test.failed)
- [tests/gnrc_sock_dns](tests/gnrc_sock_dns/test.failed)
- [tests/pkg_fatfs_vfs](tests/pkg_fatfs_vfs/test.failed)
- [tests/pkg_qdsa](tests/pkg_qdsa/test.failed)

Issues/PRs references

Fixes #11820
See also #8518 & #8024

@fjmolinas fjmolinas added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Platform: ARM Platform: This PR/issue effects ARM-based platforms Area: pm Area: (Low) power management labels Jul 25, 2019
@fjmolinas fjmolinas requested review from aabadie, kaspar030 and kYc0o July 25, 2019 16:40
@maribu
Copy link
Member

maribu commented Jul 26, 2019

I'd expect that irq_restore(), irq_disable() and irq_enable() will be functions with a single machine instruction (not taking calling convention boilerplate into account) on all platforms. E.g. if disabling IRQs would be two instructions, an interrupt could occur between those two. I bet, that no architecture wants this.

With that in mind: Wouldn't it be much better to have irq_restore(), irq_disable() and irq_enable() as header only functions that are always inlined? That would get rid of the caller convention boiler plate (and likely safe a few bytes of ROM that way), have less runtime overhead and - as a nice side effect - would solve the issue here without having to handle the stm32l152re differently than the other CPUs.

@fjmolinas
Copy link
Contributor Author

I'd expect that irq_restore(), irq_disable() and irq_enable() will be functions with a single machine instruction (not taking calling convention boilerplate into account) on all platforms. E.g. if disabling IRQs would be two instructions, an interrupt could occur between those two. I bet, that no architecture wants this.

With that in mind: Wouldn't it be much better to have irq_restore(), irq_disable() and irq_enable() as header only functions that are always inlined? That would get rid of the caller convention boiler plate (and likely safe a few bytes of ROM that way), have less runtime overhead and - as a nice side effect - would solve the issue here without having to handle the stm32l152re differently than the other CPUs.

I agree, I had though of it when first making the change. The main reason I decided to push a intermediate change was that irq_restore(), irq_disable() and irq_enable() is a much bigger change affecting many platforms, so will require more testing. Right now stm32l152re has a problem and is kind of broken (unless a cold boot is performed). I wanted to get the fix merged quickly so it is usable again, a more profound change could be done later on.

That being said, how would you have header only functions in this case? irq_restore(), irq_disable() and irq_enable() are part of a generic api. Would you move the generic irq.h header to specific implementations per cpu?

@maribu
Copy link
Member

maribu commented Jul 26, 2019

I wanted to get the fix merged quickly so it is usable again, a more profound change could be done later on.

+1

Would you move the generic irq.h header to specific implementations per cpu?

I'd say it should forward declare them as static inline __attribute__((always_inline)) unsigned irq_disable(void); and give the API documentation there. Then it should include irq_arch.h which has the header-only implementation for the specific architecture.

The forward declaration will allow the compiler to verify that the architecture specific implementation has the correct signature. Also, it will allow to have the doxygen doc written once at the generic header, and not for every implementation.

@maribu maribu added Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines labels Jul 26, 2019
@maribu
Copy link
Member

maribu commented Jul 26, 2019

I do not have the hardware to test, sadly. Otherwise I'd ACK

@fjmolinas
Copy link
Contributor Author

@jia200x the maintainer hardware access wiki says you have a nucleo-l152re, would you mind testing?

@fjmolinas fjmolinas added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Jul 26, 2019
@cladmi
Copy link
Contributor

cladmi commented Jul 29, 2019

Question, I do not see the nop in the "original disassembly", is that normal? Or was it the code to show the crash without it?

As for all I can tell, and as stated in #8518, the issue was the loss of the content of r0 when jumping to irq_restore which resulted in a corrupted pc trying to execute instructions out of the flash region.

I think it was "loss of the content of r0" that, had the consequence to crash when jumping. Maybe I am wrong and missing the right pointer.
I say this as it seems to be written as "loss of r0 because of jumping".

If my previous statement is correct and that r0 is still lost, I think it would mean that the restored value given to PRIMASK may be wrong. or am I missing something?
I am not sure how we managed to show the wrong r0 value, was it with gdb @kYc0o ? the other PR says we do not see it when 'breaking' at some instructions.

@fjmolinas
Copy link
Contributor Author

Question, I do not see the nop in the "original disassembly", is that normal? Or was it the code to show the crash without it?

I think I copied the wrong file disassembly, I posted to correct one now.

@jia200x
Copy link
Member

jia200x commented Jul 30, 2019

@jia200x the maintainer hardware access wiki says you have a nucleo-l152re, would you mind testing?

Sure!

@fjmolinas
Copy link
Contributor Author

Talking to @cladmi IRL he suggested I mention more details on the possible explanation I have for this bug, even if still very unclear...

The faulty scenario

So from debugging output at some point after wake-up the pc gets corrupted and un-reachable instructions gets executed. This only happens when DBGMCU_CR_DBG_STANDBY | DBGMCU_CR_DBG_STOP | DBGMCU_CR_DBG_SLEEP are set. An examples for the debugging output is here:

2019-07-09 17:38:58,314 - INFO # Attempting to reconstruct state for debugging...
2019-07-09 17:38:58,315 - INFO # In GDB:
2019-07-09 17:38:58,316 - INFO #   set $pc=0x7822d5e
2019-07-09 17:38:58,317 - INFO #   frame 0
2019-07-09 17:38:58,317 - INFO #   bt
2019-07-09 17:38:58,318 - INFO # 
2019-07-09 17:38:58,319 - INFO # ISR stack overflowed by at least 16 bytes.

Hints to cause

Looking around in stm32 and cortex-m3 erratas and datasheet. I found a mention of a similar issue with stm32f4 in this ERRATA section 2.1.3. In this errata there are some hints as issue that happen the WFE/WFI are placed at 4 byte alignment and problems with the pref-etch buffer. Although this is a differtent cpu (cortex-m4), it made me snoop around the pref-etch buffer and made me think a similar issue might be happening on cortex-m3

In the case of cortex-m3 the pref-etch buffer can fetch two 32bits instructions or 4 16bits instructions but only in sequential code execution. In our code the PRFTEN and ACC64 are enabled so we are reading 64 bits at a time. It stm32l1xxx reference manual it is stated:

When the code is not sequential (branch), the instruction may not be present neither in the
current instruction line used nor in the prefetched instruction line. In this case, the penalty in
terms of number of cycles is at least equal to the number of Wait States.

This lead me to believe that for some reason when the branch instruction is present it is executing a corrupted pre-fetch buffer instruction, or in other terms an instruction that isn't present. For some reason this only happens when the HCLK and FCLK stays enabled in sleep mode. This might have something to do with different wake-up times since the clock is always enabled for the core?? I wasn't able to find many details of what happens on wake-up, and what could be different when HCLK stays enabled.


In conclusion, I don't really know the reason why the hard fault occurs, why jumping just after WFI causes a hardfault, I only have hints to the reason. But this fix avoids getting into the scenario where the fault may occur, as stm32 erratas would phrase it It is a workaround a faulty scenario, but IMO the less intrusive one, no instructions are added and the code behavior is exactly the same, but avoiding the jump so with fewer instructions instead of more. Also this only happens when DBGMCU_CR_DBG_STANDBY | DBGMCU_CR_DBG_STOP | DBGMCU_CR_DBG_SLEEP is set, which is only set because of openocd.

@cladmi
Copy link
Contributor

cladmi commented Aug 2, 2019

I tried the given test:

  • With the PR:
    • it does not crash
  • with master
    • after flashing with openocd, it crashes as described.
    • after un-powering the board it does not crash anymore.
    • I tried putting 3 NOP as described in the f4 errata and it does not crash

From the sources I think they would advice to put 3 NOPs as workaround, but as gut feeling I can understand how msr PRIMASK, r0 would also fix the issue.
Also putting the price of always doing 3 NOPs for just the after flashing with openocd case is a bit expensive and would not happen in real deployments.

With the references the pull request gives enough context on where to start looking again if this re-appears in the future so think it is fine to merge even without understanding why it works (we did not know when merging the nop).

@STMicroelectronics may know about it?

Copy link
Contributor

@aabadie aabadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this PR and confirm that it fixes the described problem. I also agree with @cladmi that we can merge this even if we don't fully understand why it fixes the problem.

I found an unrelated change. Please squash it directly and I'll ACK when this is done.

@@ -165,16 +165,16 @@ static inline void cortexm_sleep(int deep)
else {
SCB->SCR &= ~(SCB_SCR_SLEEPDEEP_Msk);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this change is unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- The __NOP() that was added in RIOT-OS#8518 is now remooved.
- When DBG_STANDBY, DBG_STOP or DBG_SLEEP are set in DBG_CR a hardfault
  occurs on wakeup from sleep. This was first diagnosed in RIOT-OS#8518. When
  enabled, a hardfault occured when returning from a branch to irq_restore()
  we avoid the call by inlining the function call. See RIOT-OS#11830 for more
  details.
@aabadie aabadie added the Reviewed: 3-testing The PR was tested according to the maintainer guidelines label Aug 5, 2019
Copy link
Contributor

@aabadie aabadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, ACK

@aabadie
Copy link
Contributor

aabadie commented Aug 5, 2019

And go!

@aabadie aabadie merged commit 1026fe3 into RIOT-OS:master Aug 5, 2019
@fjmolinas fjmolinas deleted the pr_fix_11820_2 branch August 7, 2019 15:40
@fjmolinas
Copy link
Contributor Author

@aabadie Thanks for the review! It's great to having nucleo-l152re working again :)

@benpicco
Copy link
Contributor

benpicco commented May 4, 2020

stm32l152re is oddly specific - should this affect all stm32l1x then?

@fjmolinas
Copy link
Contributor Author

fjmolinas commented May 4, 2020

stm32l152re is oddly specific - should this affect all stm32l1x then?

No, see #8518, #11820, #11919 and #8024 for more details. EDIT: at lesast not present on stm32l151 #8518 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: pm Area: (Low) power management CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: ARM Platform: This PR/issue effects ARM-based platforms Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stm32l152re: hard-fault unless power-cycled after flash, or depending on optimization
7 participants