Skip to content

[Journaling] Thread-safety issue during recovery#9624

Merged
ReubenBond merged 1 commit intodotnet:mainfrom
ledjon-behluli:sm-manager-thread-safety
Jul 31, 2025
Merged

[Journaling] Thread-safety issue during recovery#9624
ReubenBond merged 1 commit intodotnet:mainfrom
ledjon-behluli:sm-manager-thread-safety

Conversation

@ledjon-behluli
Copy link
Contributor

@ledjon-behluli ledjon-behluli commented Jul 26, 2025

On recovery, the StateMachineManager notifies all state machines that the operation has completed, but it does so without holding a lock. This can result in the typical Collection was modified; enumeration operation may not execute., especially when exceptions happen in grain code which triggers recovery, while another grain is activating and registering its state machines with the manager.

Microsoft Reviewers: Open in CodeFlow

lock (_lock)
{
stateMachine.OnRecoveryCompleted();
foreach (var stateMachine in _stateMachines.Values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could exhibit the same issue if OnRecoveryComplete is able to register another state machine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, enumerating over a copy would be the safer choice. At least it does not occur from concurrency (which comes as a surprise to users)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where the concurrency is coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery process itself is done safely, but as new activations are created so are the instances of state machines as part of the grain's ctor. Each of them registers themselves in the state machine manager as their ctor runs. That process happens outside the work loop of the manager, and it can modify the list of state machines which is currently being enumerated as part of notifying all state machines that recovery is completed. Those two operations happen in concurrently by different threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking on the recovery notification, means concurrent attempts for state machines to register themselves have to wait until all SMs are notified, i.e. the RegisterStateMachine(name, sm) correctly locks but that lock is open if the enumeration does not hold the lock (which it does now in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats how i see it, hope it makes sense!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateMachineManager is supposed to be created per-activation, though, so all of this should be happening on the same thread. Do you have a repro somewhere that I could look at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, how did i miss that!!! Yeah, it happens sometimes for the "automatic job scenario"

https://github.com/ledjon-behluli/DurableStateMachines/blob/main/playground/DurableStateMachines.CTS/Program.cs#L49

 Orleans.Journaling.StateMachineManager[2114651837]
      Error processing work items.
      System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
         at System.Collections.Generic.Dictionary`2.ValueCollection.Enumerator.MoveNext()
         at Orleans.Journaling.StateMachineManager.RecoverAsync(CancellationToken cancellationToken) in /_/src/Orleans.Journaling/StateMachineManager.cs:line 306
         at Orleans.Journaling.StateMachineManager.WorkLoop() in /_/src/Orleans.Journaling/StateMachineManager.cs:line 104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look too closely into that as i was under the assumption that the SM manager was for the silo, so i got fooled by the lack of lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is good anyway

@ReubenBond ReubenBond merged commit 6039bac into dotnet:main Jul 31, 2025
28 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants