CFE_ES_PerfLogAdd needs better mutual exclusion #456

jphickey · 2020-01-07T19:58:48Z

Describe the bug
The CFE_ES_PerfLogAdd() routine uses OS_IntLock() in an attempt to get exclusive access to a common global data structure to record performance metrics/state. This is insufficient.

The OS_IntLock function is generally not implemented on SMP, and even if it is, it probably only affects the current core. Either way, it will not provide exclusivity, because the other cores can still access the data even when interrupts are disabled.

This function is also a no-op in the POSIX OSAL.

To Reproduce
Enable performance monitoring on a POSIX system and observe that occasionally samples occur in the log out of order or otherwise appear corrupted. This is likely due to concurrent writes to the same entry related to insufficient locking.

Expected behavior
The function should use some form of primitive that actually does provide exclusivity between threads (such as a mutex/spinlock) and not an interrupt lock.

Code snips

cFE/fsw/cfe-core/src/es/cfe_es_perf.c

Lines 439 to 440 in 2b27dfc

    
           /* disable interrupts to guarentee exclusive access to the data structures */ 
        
            IntFlags = OS_IntLock();

System observed on:
Ubuntu 18.04 LTS 64-bit

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

The text was updated successfully, but these errors were encountered:

jphickey · 2020-03-03T14:02:22Z

Topic for CCB discussion:

This is the tradeoff -- do we value the constant time, nonblocking behavior or do we value memory efficiency as higher value? And how does the CCB feel about dynamic allocation?

Backround:

Historically the CFE_ES_PerfLogAdd is used during realtime tasks to track performance issues. As such it is important that it runs in a deterministic fashion (constant time) as to not interfere with the realtime tasks that it is recording.

If we simply change to using a conventional OSAL mutex - this means we no longer will have a deterministic performance log function, because if two threads call it at the same time, one will block.

One option to preserve the non-blocking nature of the performance log is to give each task a dedicated block of entries in the log. Writes will be fast and efficient but major downside are that this is a much less efficient use of memory, as we either have to do some dynamic allocation based on how often the task uses this API, or we divide the log into fixed size segments which means tasks that don't even use the API will get memory and waste it. Or we can add a new API so that a task can control the size of the performance log at runtime. (this could leverage the ES mempool API, not necessarily "malloc").

Looking for input from CCB.

skliper · 2020-03-03T20:23:21Z

And how does the CCB feel about dynamic allocation?

Dynamic allocation after initialization is against GSFC coding standard.

jphickey · 2020-03-03T20:39:04Z

Dynamic allocation after initialization is against GSFC coding standard.

This would be ES memory pools, not malloc. Note that Software Bus and Table Services all currently allocate buffers from ES pools after initialization.

That being said, preferably this would be during initialization, as part of task creation, not after it, otherwise you would lose the constant time behavior which is the whole point. During init, a task would have to indicate how large of a performance monitoring buffer it needs, and the memory for that performance log would be pulled out of a pool at that time.

The biggest issue with that approach is of course that all existing tasks would have to be updated and/or be issued a minimal buffer by default.

jphickey · 2020-03-03T20:45:03Z

Alternatively, instead of using an ES memory pool, the app could theoretically declare its own storage buffer (however big it wants) as a global and then register it with ES as part of start up.

That might be a viable alternative?

skliper · 2020-03-03T20:50:05Z

This would be ES memory pools...

Copy, memory pool is fine. App declaring it's own storage seems fine also, just need a way to query for reporting.

Or is there a way to handle allocating and reporting atomically to avoid the race conditions? Atomically tag an entry with the task ID (or whatever), tasks should skip over tagged entries and to avoid race when reporting set a field to skip processing from task calls (assumes task processing is higher priority than reporting)? Or similar?

skliper · 2020-03-03T20:53:00Z

just need a way to query for reporting.

Nevermind, register makes sense.

skliper · 2020-03-03T20:55:19Z

I like your local storage idea the more I think about it.

jphickey · 2020-03-03T21:15:34Z

Atomically tag an entry with the task ID (or whatever), tasks should skip over tagged entries

Any single-buffer (shared memory) architecture would necessitate a read-compare-write operation somewhere, which needs to be atomic for thread safety, so this would necessitate either a mutex or using an atomic compare-and-swap routine which is not standardized until C11.

I like your local storage idea the more I think about it.

The biggest issue with local storage is that it is a breaking change without much of a backward compatibility option. Currently apps just register themselves (via CFE_ES_RegisterApp and start calling CFE_ES_PerfLogAdd() at will.

With a mempool based approach, apps could at least be issued a minimal performance buffer to (somewhat?) preserve the exiting performance log behavior if they don't make any changes.

With a local allocation, This would require an extra registration step for each of the perf IDs that an app plans on using, which could be considered a good thing. If the app does not register the perf ID then logs can be silently dropped because there is nowhere to store it.

Maybe that's an OK tradeoff?

In the end it just means that Perf Log operations become no-ops until the app is updated to register its own perf log buffer.

skliper · 2020-03-03T22:51:02Z

silently dropped

Can we make this - silently dropped unless debug print is enabled?

jphickey · 2020-03-11T20:03:26Z

Getting further into this code, and noted that the performance log data is actually stored within the "PSP Reset Area" which is really just a block of memory provided by the PSP which is (theoretically) supposed to be maintained across a processor reset. Obviously, this won't be the case if each app provides its own memory.

Going back to CFE requirements, this log seems to be covered by cES1021 and cES1022, neither of which indicate any requirement for the performance analyzer log to be preserved across a processor reset.

Should there be any sort of requirement for preservation of this log in the context of a processor reset? (i.e. do we need to add a requirement)

-OR-

Can we move the structure to regular RAM and just accept that it will be purged with a processor reset.

ALSO NOTE - I don't see any guarantee that the underlying system tick counter provided by the PSP will be consistent across a reset. In fact most likely it is not. Therefore perf samples acquired prior to a reset will not be comparable to samples acquired after a reset, even if the log itself is preserved. IMO this substantially reduces the value of trying to save the log, if it isn't really usable data after the reset.

jphickey · 2020-03-11T20:26:21Z

Looking for input from @jwilmot and @acudmore - can you provide insight on whether it is important to store the performance log in the reset area memory provided by the PSP? Or can we move it to normal RAM?

The data in the log will still be comparable to other samples in the log before the reset, so it does have some value in determining what happened before the processor reset occurred, even if new data that goes into the log is not comparable.

We are also changing the exception handling, so the other possibility is that the PSP simply takes a snapshot of the log when an exception occurs. That would still preserve it, and would be better as it keeps it isolated from new data coming in.

Please advise on what you think is the best way forward!
Thanks

skliper · 2020-03-11T22:02:37Z

My two cents - remove this undocumented functionality (no requirement to store performance across resets).

jphickey · 2020-03-12T13:24:38Z

That approach would certainly be simpler, but I do see value in at least having an interface where the PSP could capture and store the data as part of its exception handling. It could be valuable info when debugging why an unexpected exception/reset occurred.

I assume that's the reason it was put into the reset area in the first place, although I'm not convinced it is the best way to accomplish that goal (assuming that was the goal).

skliper · 2020-03-12T13:35:33Z

I don't think it's the best way, and doesn't justify the extra complexity. Trace/context, sure... performance? Questionable.

jphickey · 2020-03-12T14:39:45Z

Performance data is also context data, which is why I think it could be very relevant when debugging a failure. The data will show (for instance) if two tasks attempted to run at the same time and/or if one preempted the other prior to the exception/reset occurring.

This is why I think it could be very useful and beneficial to at least have a method to capture the data as part of exception handling.

jphickey · 2020-03-17T12:34:32Z

Update - While I was initially gravitating toward giving each task its own dedicated lockless memory block for logging, this creates some challenges in implementing the "trigger" semantics, which need to operate at a global scope across all tasks.

Although I think it is still possible to do this by implementing a background task to merge all the realtime task data and handle the trigger logic, it will considerably more complex than just using a mutex in the perf data log and implementing the logic in-line as it is today, so I'm not convinced it meets the cost-benefit tradeoff.

So - my new thinking is to lean toward simplicity and use a mutex. The big downside is that when using a mutex, the CFE_ES_PerfLogAdd function now may invoke a task/context switch so it can actually interfere with the realtime behavior of the process being monitored. However none of this happens unless performance logging is enabled (i.e. when perf log state is idle, no locking occurs).

Still looking for feedback as to whether the user community would consider this an issue.

skliper · 2020-03-17T14:01:55Z

I prefer the mutex. We can document as part of the limitations, historically I've found this good enough for general software timing in a generic sense for the majority of cases. There's other mechanisms projects can use if they need more accurate numbers that are hardware specific (twiddle a register to control an IO line for a logic analyzer) if what we provide isn't sufficient. I don't think we need to implement anything heroic.

edit - bad english.

jphickey · 2020-03-17T16:13:34Z

OK - will proceed with the mutex-based approach, which is a fairly straightforward change.

ghost · 2020-03-25T15:04:37Z

I agree that the mutex based approach works.
I'll check with PACE, but he might have used spin locks on the PACE SMP work. That is not available to us now, but would that be an option when we add SMP capability? Would that prevent a context switch?

As for preserving the log. several GSFC missions have not used the performance log in flight, only for performance analysis during development. Is it worth a poll to ask the community?

jphickey · 2020-03-26T15:39:49Z

Yes, a spinlock is better for situations like this, where multiple tasks may try to access a shared resource but only for short periods of time (i.e. atomic read-modify-write ops) and simply re-trying the operation a second time is likely to avoid the conflict.

Someday if/when OSAL gets a spinlock API, that may be a good alternative. Just have to make sure whatever spinlock is used falls back to sane behavior if/when running on a single core processor or the CFE is locked to a single core.

Use a normal mutex for access control to the perf log structure Perform file write in background using a dedicated background task

astrogeco · 2020-04-07T18:32:06Z

@jphickey do you think this fix will be ready soon? I'm debating whether to wait for it or go ahead with the integration merge without nasa/osal#396

jphickey · 2020-04-07T19:15:54Z

Just got some initial buy-in on the WIP branches and I need to complete the unit test updates to match the proposed implementation.

It should be ready by the next CCB (4/15).

jphickey · 2020-04-07T19:17:10Z

My recommendation would be to push the current integration candidate and accept that it does not build with omit deprecated.

This is why other CI systems let one "quarantine" certain failures that we know are temporarily broken but we still want to watch for additional failures. We can't do that, so it is what it is.

skliper · 2020-04-07T22:10:58Z

Optionally wrap the interrupt lock calls in the deprecated define (or just remove) to fix in the short term... it's what I did to test, and the performance log has race issues either way until fully fixed w/ this fix.

Fix #456, Perf log threading and concurrency issues

skliper added the bug label Jan 7, 2020

skliper added this to the 6.8.0 milestone Jan 7, 2020

skliper assigned jphickey Mar 2, 2020

skliper mentioned this issue Mar 26, 2020

Fix #385, Deprecate interrupt related APIs nasa/osal#396

Merged

jphickey mentioned this issue Mar 29, 2020

CFE_ES_ScanAppTable() possible race conditions #295

Closed

jphickey added a commit to jphickey/cFE that referenced this issue Apr 3, 2020

WIP for nasa#456, perf log race condition fixes

3dfaae4

Use a normal mutex for access control to the perf log structure Perform file write in background using a dedicated background task

skliper mentioned this issue Apr 7, 2020

Integration Candidate 2020-04-01 nasa/cFS#64

Merged

jphickey mentioned this issue Apr 9, 2020

Fix #456, Perf log threading and concurrency issues #595

Merged

astrogeco added a commit that referenced this issue Apr 9, 2020

Merge pull request #595 from jphickey/fix-456-perflog-race

04f1fdd

Fix #456, Perf log threading and concurrency issues

astrogeco mentioned this issue Apr 9, 2020

Integration Candidate - 2020-04-01 #586

Merged

astrogeco closed this as completed in a268837 Apr 9, 2020

skliper mentioned this issue May 12, 2020

ES PerfLog dumper thread should be persistent #324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFE_ES_PerfLogAdd needs better mutual exclusion #456

CFE_ES_PerfLogAdd needs better mutual exclusion #456

jphickey commented Jan 7, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 3, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

skliper commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 11, 2020

jphickey commented Mar 11, 2020

skliper commented Mar 11, 2020

jphickey commented Mar 12, 2020

skliper commented Mar 12, 2020

jphickey commented Mar 12, 2020

jphickey commented Mar 17, 2020

skliper commented Mar 17, 2020 •

edited

Loading

jphickey commented Mar 17, 2020

ghost commented Mar 25, 2020 •

edited by ghost

Loading

jphickey commented Mar 26, 2020

astrogeco commented Apr 7, 2020

jphickey commented Apr 7, 2020

jphickey commented Apr 7, 2020

skliper commented Apr 7, 2020

CFE_ES_PerfLogAdd needs better mutual exclusion #456

CFE_ES_PerfLogAdd needs better mutual exclusion #456

Comments

jphickey commented Jan 7, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 3, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

skliper commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 3, 2020

skliper commented Mar 3, 2020

jphickey commented Mar 11, 2020

jphickey commented Mar 11, 2020

skliper commented Mar 11, 2020

jphickey commented Mar 12, 2020

skliper commented Mar 12, 2020

jphickey commented Mar 12, 2020

jphickey commented Mar 17, 2020

skliper commented Mar 17, 2020 • edited Loading

jphickey commented Mar 17, 2020

ghost commented Mar 25, 2020 • edited by ghost Loading

jphickey commented Mar 26, 2020

astrogeco commented Apr 7, 2020

jphickey commented Apr 7, 2020

jphickey commented Apr 7, 2020

skliper commented Apr 7, 2020

skliper commented Mar 17, 2020 •

edited

Loading

ghost commented Mar 25, 2020 •

edited by ghost

Loading