Skip to content

Design document for Crash Reporting feature in MbedOS #8561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 22, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fixed sentences and context
  • Loading branch information
SenRamakri committed Nov 19, 2018
commit c3d2c446d02e323d762ee7bcc4ed97c7b0d7aa7e
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,21 @@

### Revision history

1.0 - Initial version - Senthil Ramakrishnan - 10/10/2018
1.0 - Initial version - Senthil Ramakrishnan - 10/22/2018

# Introduction

### Overview and background

MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the issue later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot send this information using mechanisms like Network because the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. So, if we can auto-reboot the system after a fatal error has occurred, without losing the RAM contents where we have the error information collected, we can send this information over network or other interfaces to be logged externally(E.g:- ARM Pelion cloud) or can even be written to file system if required.
MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the problem later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot save this information by sending it over network or writing to a file, as the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably to be logged externally(E.g:- ARM Pelion cloud) or can be written to file system.

### Requirements and assumptions

This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmonr - This is captured in detail in Detailed Design section.


# System architecture and high-level design

Below are the high-level design goals for "Crash Reporting" feature:
Below are the high-level goals for "Crash Reporting" feature:

**Error information collection including exception context**

Expand All @@ -54,7 +54,7 @@ During reboot the system should check if the reboot is caused by a fatal error a

**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots**

System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached. The number of times auto-reboot happens should be configurable.
System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.

**Implementation should provide following configuration options**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open-ended question. How do we expect this to be tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.


Expand All @@ -69,7 +69,7 @@ The below diagram shows overall architecture of crash-reporting implementation.

![System architecture and component interaction](./diagrams/crash-report1.jpg)

As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or just capture the error information to be acted upon later.
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what point is the section of RAM zero'd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another silly question. Is this able to live side by side with mbed-trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, this doesn't have any impact or conflict with mbed-trace.


# Detailed design

Expand Down