Deadlock when using SinglePhaseCommit with distributed transactions #1800
Description
In .NET 7.0, the OleTx/MSDTC distributed transactions support has been ported over from .NET Framework (dotnet/runtime#715). This means that it will now be possible to use distributed transactions with .NET 7.0, on Windows only.
We've received a bug report (dotnet/runtime#76010) with two separate console programs, enlisting which propagate a transaction. The initiating program (the "outer" transaction) enlists only a single SqlConnection to the transaction, and hangs when the transaction is committed. This happens quite consistently, but not 100% reliably. To reproduce this, simply run the .NET 7 version of the application in dotnet/runtime#76010 in two separate windows; the first one should hang.
When the program hangs, two threads are stuck with the following stack traces:
Stack 1
SinglePhaseEnlistment.Committed()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\d42a13b4dda85c7ff6fe99f383b4a1b279abd39735cb45449b7401d99c387\SinglePhaseEnlistment.cs:line 62
SqlDelegatedTransaction.SinglePhaseCommit()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\7c62ff4f1cfc3e61c3e65f286d919f9f0bbb2ef6a822b99cb37623d6d3a891\SqlDelegatedTransaction.cs:line 389
TransactionStateDelegatedCommitting.EnterState()
CommittableTransaction.Commit()
TransactionScope.InternalDispose()
TransactionScope.Dispose()
Program.StartTransaction()
async Program.Main()
AsyncMethodBuilderCore.Start<System.__Canon>()
Program.Main()
Program.<Main>()
Stack 2:
DbConnectionInternal.DetachTransaction()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 413
DbConnectionInternal.CleanupConnectionOnTransactionCompletion()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 436
DbConnectionInternal.TransactionCompletedEvent()
TransactionStatePromotedCommitted.EnterState()
InternalTransaction.DistributedTransactionOutcome()
RealOletxTransaction.FireOutcome()
OutcomeEnlistment.InvokeOutcomeFunction()
OletxTransactionManager.ShimNotificationCallback()
PortableThreadPool.CompleteWait()
ThreadPoolWorkQueue.Dispatch()
PortableThreadPool.WorkerThread.WorkerThreadStart()
[Native to Managed Transition]
Here is the general flow leading up to this state:
-
When the TransactionScope is disposed, we get to
SqlDelegatedTransaction.SinglePhaseCommit()
, which is the SqlClient part that interacts with System.Transactions.
a.SqlDelegatedTransaction.SinglePhaseCommit()
locksconnection
(the SqlInternalConnectionTds)
b. It then sends Commit to SQL Server. Since the transaction is delegated, the triggers the commit with MSDTC, which causes commit notifications to be sent to the enlisted parties. This includes the running program, which starts thread 2 below running concurrently.
b. Finally, it proceeds to callenlistment.Committed()
, which isSinglePhaseEnlistment.Committed()
, while keeping the lock on SqlInternalConnectionTds. That method then takes_internalEnlistment.SyncRoot
, which is the InternalTransaction. -
The above commit causes us to get a notification from the native layer (MSDTC) that the transaction committed (triggered above in 1b).
a. We get toInternalTransaction.DistributedTransactionOutcome()
, which locks the InternalTransaction.
b. We then get toDbConnectionInternal.DetachTransaction()
(DbConnectionInternal is registered as a listener on the Transaction.TransactionCompleted event), which attempts to lockthis
(the SqlInternalConnectionTds).
So thread 1 locks the SqlInternalConnectionTds and then the InternalTransaction, while thread 2 - which runs concurrently - locks InternalTransaction and then SqlInternalConnectionTds. This produces a deadlock.
I'm not an expert in this code, but moving the call to enlistment.Committed()
(code) outside of the lock could be a way to resolve this deadlock.
Note that I couldn't reproduce the deadlock in .NET Framework; there's likely some timing differences which make the deadlock manifest far more frequently on .NET Core, but AFAICT the bug is in Framework as well. Note that this repro uses SinglePhaseCommit - only one connection is enlisted to the TransactionScope in the application. I recommend also checking having two connections to force 2PC - the same bug could be present in that flow as well.
Many thanks to @nathangreaves for providing the original repro code.
/cc @ajcvickers
Activity