Skip to content

Race condition in control requests #950

@jphickey

Description

@jphickey

Describe the bug
Due to the order of operations in clean up, the ES global lock is given up and then re-acquired:

CFE_ES_UnlockSharedData(__func__,__LINE__);
CFE_ES_ProcessControlRequest(AppPtr);
CFE_ES_LockSharedData(__func__,__LINE__);

The problem is that this provides a window of opportunity for the underlying state to change externally while the global data is unlocked.

To Reproduce
This can happen, for instance, if the task that is being cleaned up calls CFE_ES_ExitApp() while this state machine is also cleaning up the app.
This actually does happen because CFE_ES_RunLoop() will return false if there is an exit request pending. It is just masked by the fact that most apps are pending in a message receive queue, so they don't self exit - they are deleted by ES instead.

I was able to get CFE to segfault/crash by allowing SAMPLE_APP to exit itself at the very same time that this state machine was also cleaning it up.

Expected behavior
No crashes, proper clean up.

System observed on:
Ubuntu 20.04

Additional context
Due to the ~5 second exit/cleanup delay it is unlikely to occur "in the wild" but it can easily be forced to happen. In my test I just used a slightly modified sample_app that doesn't pend forever on CFE_SB_RcvMsg, and also delays itself such that it self-exits at the exact same time that the ES background job is running, which reliably segfaults every time.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions