-
Notifications
You must be signed in to change notification settings - Fork 3k
Design document for Crash Reporting feature in MbedOS #8561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
d4fc8fe
5489eac
28a0b45
a6e7604
c3d2c44
a721158
108483d
8c48a24
0b9cd60
340099c
3ffa78e
a87043f
2d58f23
a0e42fa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,21 +20,21 @@ | |
|
||
### Revision history | ||
|
||
1.0 - Initial version - Senthil Ramakrishnan - 10/10/2018 | ||
1.0 - Initial version - Senthil Ramakrishnan - 10/22/2018 | ||
|
||
# Introduction | ||
|
||
### Overview and background | ||
|
||
MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the issue later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot send this information using mechanisms like Network because the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. So, if we can auto-reboot the system after a fatal error has occurred, without losing the RAM contents where we have the error information collected, we can send this information over network or other interfaces to be logged externally(E.g:- ARM Pelion cloud) or can even be written to file system if required. | ||
MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the problem later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot save this information by sending it over network or writing to a file, as the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably to be logged externally(E.g:- ARM Pelion cloud) or can be written to file system. | ||
|
||
### Requirements and assumptions | ||
|
||
This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Allocated where? When? How? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cmonr - This is captured in detail in Detailed Design section. |
||
|
||
# System architecture and high-level design | ||
|
||
Below are the high-level design goals for "Crash Reporting" feature: | ||
Below are the high-level goals for "Crash Reporting" feature: | ||
|
||
**Error information collection including exception context** | ||
|
||
|
@@ -54,7 +54,7 @@ During reboot the system should check if the reboot is caused by a fatal error a | |
|
||
**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots** | ||
|
||
System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached. The number of times auto-reboot happens should be configurable. | ||
System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached. | ||
|
||
**Implementation should provide following configuration options** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open-ended question. How do we expect this to be tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting |
||
|
||
|
@@ -69,7 +69,7 @@ The below diagram shows overall architecture of crash-reporting implementation. | |
|
||
 | ||
|
||
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or just capture the error information to be acted upon later. | ||
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At what point is the section of RAM zero'd? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another silly question. Is this able to live side by side with mbed-trace? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, of course, this doesn't have any impact or conflict with mbed-trace. |
||
|
||
# Detailed design | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.