Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in control requests #950

Closed
jphickey opened this issue Oct 15, 2020 · 0 comments · Fixed by #960 or #975
Closed

Race condition in control requests #950

jphickey opened this issue Oct 15, 2020 · 0 comments · Fixed by #960 or #975
Assignees
Labels
Milestone

Comments

@jphickey
Copy link
Contributor

Describe the bug
Due to the order of operations in clean up, the ES global lock is given up and then re-acquired:

CFE_ES_UnlockSharedData(__func__,__LINE__);
CFE_ES_ProcessControlRequest(AppPtr);
CFE_ES_LockSharedData(__func__,__LINE__);

The problem is that this provides a window of opportunity for the underlying state to change externally while the global data is unlocked.

To Reproduce
This can happen, for instance, if the task that is being cleaned up calls CFE_ES_ExitApp() while this state machine is also cleaning up the app.
This actually does happen because CFE_ES_RunLoop() will return false if there is an exit request pending. It is just masked by the fact that most apps are pending in a message receive queue, so they don't self exit - they are deleted by ES instead.

I was able to get CFE to segfault/crash by allowing SAMPLE_APP to exit itself at the very same time that this state machine was also cleaning it up.

Expected behavior
No crashes, proper clean up.

System observed on:
Ubuntu 20.04

Additional context
Due to the ~5 second exit/cleanup delay it is unlikely to occur "in the wild" but it can easily be forced to happen. In my test I just used a slightly modified sample_app that doesn't pend forever on CFE_SB_RcvMsg, and also delays itself such that it self-exits at the exact same time that the ES background job is running, which reliably segfaults every time.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey self-assigned this Oct 15, 2020
@jphickey jphickey added the bug label Oct 15, 2020
jphickey added a commit to jphickey/cFE that referenced this issue Oct 19, 2020
Because the process of handling a control request involves
calling other subsystems, the ES lock needs to be released.
However, this also means that the app record can change state
for other reasons, such as the app self-exiting at the same
time.

To avoid this possibility, process in two phases:

First assemble a list of tasks that have timed out
and need to be cleaned up, while ES is locked.

Next actually perform the cleanup, while ES is unlocked.
In areas during cleanup that need to update the ES global,
the lock is locally re-acquired and released.
jphickey added a commit to jphickey/cFE that referenced this issue Oct 21, 2020
Because the process of handling a control request involves
calling other subsystems, the ES lock needs to be released.
However, this also means that the app record can change state
for other reasons, such as the app self-exiting at the same
time.

To avoid this possibility, process in two phases:

First assemble a list of tasks that have timed out
and need to be cleaned up, while ES is locked.

Next actually perform the cleanup, while ES is unlocked.
In areas during cleanup that need to update the ES global,
the lock is locally re-acquired and released.
@jphickey jphickey linked a pull request Oct 21, 2020 that will close this issue
@astrogeco astrogeco added this to the 7.0.0 milestone Oct 21, 2020
jphickey added a commit to jphickey/cFE that referenced this issue Oct 26, 2020
Because the process of handling a control request involves
calling other subsystems, the ES lock needs to be released.
However, this also means that the app record can change state
for other reasons, such as the app self-exiting at the same
time.

To avoid this possibility, process in two phases:

First assemble a list of tasks that have timed out
and need to be cleaned up, while ES is locked.

Next actually perform the cleanup, while ES is unlocked.
In areas during cleanup that need to update the ES global,
the lock is locally re-acquired and released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants