[Journaling] Thread-safety issue during recovery#9624
Conversation
| lock (_lock) | ||
| { | ||
| stateMachine.OnRecoveryCompleted(); | ||
| foreach (var stateMachine in _stateMachines.Values) |
There was a problem hiding this comment.
This could exhibit the same issue if OnRecoveryComplete is able to register another state machine
There was a problem hiding this comment.
True, enumerating over a copy would be the safer choice. At least it does not occur from concurrency (which comes as a surprise to users)
There was a problem hiding this comment.
Do you know where the concurrency is coming from?
There was a problem hiding this comment.
The recovery process itself is done safely, but as new activations are created so are the instances of state machines as part of the grain's ctor. Each of them registers themselves in the state machine manager as their ctor runs. That process happens outside the work loop of the manager, and it can modify the list of state machines which is currently being enumerated as part of notifying all state machines that recovery is completed. Those two operations happen in concurrently by different threads.
There was a problem hiding this comment.
Locking on the recovery notification, means concurrent attempts for state machines to register themselves have to wait until all SMs are notified, i.e. the RegisterStateMachine(name, sm) correctly locks but that lock is open if the enumeration does not hold the lock (which it does now in this PR)
There was a problem hiding this comment.
Thats how i see it, hope it makes sense!
There was a problem hiding this comment.
StateMachineManager is supposed to be created per-activation, though, so all of this should be happening on the same thread. Do you have a repro somewhere that I could look at?
There was a problem hiding this comment.
You are right, how did i miss that!!! Yeah, it happens sometimes for the "automatic job scenario"
Orleans.Journaling.StateMachineManager[2114651837]
Error processing work items.
System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.Collections.Generic.Dictionary`2.ValueCollection.Enumerator.MoveNext()
at Orleans.Journaling.StateMachineManager.RecoverAsync(CancellationToken cancellationToken) in /_/src/Orleans.Journaling/StateMachineManager.cs:line 306
at Orleans.Journaling.StateMachineManager.WorkLoop() in /_/src/Orleans.Journaling/StateMachineManager.cs:line 104
There was a problem hiding this comment.
I did not look too closely into that as i was under the assumption that the SM manager was for the silo, so i got fooled by the lack of lock.
On recovery, the
StateMachineManagernotifies all state machines that the operation has completed, but it does so without holding a lock. This can result in the typical Collection was modified; enumeration operation may not execute., especially when exceptions happen in grain code which triggers recovery, while another grain is activating and registering its state machines with the manager.Microsoft Reviewers: Open in CodeFlow