Skip to content

Hung Dispose() caused by infinite socket timeout #1275

Closed
@jkillingsworth

Description

@jkillingsworth

Problem

From time to time, my application hangs indefinitely. My application is a headless system that runs on a schedule. When the system hangs, it requires a human operator to notice that the system is hung and then kill the process manually. I have asked the operator to capture process dumps before killing the process. I have analyzed several of these dumps with WinDbg. They each indicate that the system hangs during an SFTP operation.

Based on the information I have gathered, it looks like the problem is a combination of two separate threads that are hanging indefinitely. One thread is waiting for a lock. The other thread is waiting for a socket read operation to complete.

Evidence from WinDbg that a thread is waiting for a lock:

OS Thread Id: 0x27d0 (12)
        Child SP               IP Call Site
00000038d72b8e28 00007ff9f4066974 [GCFrame: 00000038d72b8e28] 
00000038d72b8f48 00007ff9f4066974 [GCFrame: 00000038d72b8f48] 
00000038d72b8fa8 00007ff9f4066974 [HelperMethodFrame_1OBJ: 00000038d72b8fa8] System.Threading.Monitor.Enter(System.Object)
00000038d72b90a0 00007ff98bbbbd1a Renci.SshNet.Session.IsSocketConnected()
00000038d72b9110 00007ff98bbbba9d Renci.SshNet.Channels.Channel.Close()
00000038d72b9160 00007ff98bbbb9f0 Renci.SshNet.Channels.ChannelSession.Close()
00000038d72b9190 00007ff98bbbb6cd Renci.SshNet.Channels.Channel.Dispose(Boolean)
00000038d72b91d0 00007ff98bbbb417 Renci.SshNet.Channels.ChannelSession.Dispose(Boolean)
00000038d72b9210 00007ff98bbbb3ca Renci.SshNet.Channels.Channel.Dispose()
00000038d72b9240 00007ff98bbbb02a Renci.SshNet.SubsystemSession.Disconnect()
00000038d72b9280 00007ff98bbbaec0 Renci.SshNet.SubsystemSession.Dispose(Boolean)
00000038d72b92b0 00007ff98bbbae67 Renci.SshNet.Sftp.SftpSession.Dispose(Boolean)
00000038d72b92f0 00007ff98bbbae1a Renci.SshNet.SubsystemSession.Dispose()
00000038d72b9320 00007ff98bbbadc5 Renci.SshNet.SftpClient.OnDisconnecting()
00000038d72b9350 00007ff98bbbad3d Renci.SshNet.BaseClient.Disconnect()
00000038d72b9380 00007ff98bbbacc0 Renci.SshNet.BaseClient.Dispose(Boolean)
00000038d72b93b0 00007ff98bbbac57 Renci.SshNet.SftpClient.Dispose(Boolean)
00000038d72b93f0 00007ff98bbbac15 ////////////////////REDACTED////////////////////.Dispose()
00000038d72b9430 00007ff98bb7219b ////////////////////REDACTED////////////////////
00000038d72bcef8 00007ff9e6e3db92 [HelperMethodFrame: 00000038d72bcef8] 
00000038d72bcfe0 00007ff98bbb6fe5 Renci.SshNet.SubsystemSession.WaitOnHandle(System.Threading.WaitHandle, Int32)
00000038d72bd030 00007ff98bbb9432 Renci.SshNet.Sftp.SftpSession.RequestReadDir(Byte[])
00000038d72bd0b0 00007ff98bbb8992 Renci.SshNet.SftpClient.InternalListDirectory(System.String, System.Action`1)
--------------------TRUNCATED--------------------

Evidence from WinDbg that another thread is waiting for a socket read operation to complete:

OS Thread Id: 0x1414 (10)
        Child SP               IP Call Site
00000038d633e778 00007ff9f4065ea4 [InlinedCallFrame: 00000038d633e778] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e778 00007ff98a4ac501 [InlinedCallFrame: 00000038d633e778] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e750 00007ff98a4ac501 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e800 00007ff98a4ae819 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
00000038d633e880 00007ff98bb94922 Renci.SshNet.Abstractions.SocketAbstraction.Read(System.Net.Sockets.Socket, Byte[], Int32, Int32, System.TimeSpan)
00000038d633e8f0 00007ff98bb95c5f Renci.SshNet.Session.TrySocketRead(System.Net.Sockets.Socket, Byte[], Int32, Int32)
00000038d633e920 00007ff98bb95a40 Renci.SshNet.Session.ReceiveMessage(System.Net.Sockets.Socket)
00000038d633e9b0 00007ff98bb9517e Renci.SshNet.Session.MessageListener()
00000038d633ea00 00007ff9e3ec00b2 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
00000038d633ead0 00007ff9e3ebff35 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
00000038d633eb00 00007ff9e3e8bc06 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
00000038d633eb40 00007ff9e3e8ae46 System.Threading.ThreadPoolWorkQueue.Dispatch()
--------------------TRUNCATED--------------------

My application code looks something like this:

using (var client = new SftpClient(connectionInfo))
{
    client.OperationTimeout = TimeSpan.FromSeconds(30);
    client.Connect();

    var items = client.ListDirectory("/");

    // ...
}

Note

There is an implicit call to client.Dispose() at the end of the using block in the code above.

Reproduction

My theory is that the following sequence of events occurs:

  1. The client connects to the server successfully.
  2. The client attempts to do a list directory operation.
  3. The main thread invokes a secondary thread to perform the operation.
  4. The secondary thread acquires a lock.
  5. The secondary thread attempts a socket read operation.
  6. The socket read operation never completes.
  7. The main thread throws an SshOperationTimeoutException exception.
  8. The Dispose() method of the SftpClient class is called implicitly at the end of the using block.
  9. The Dispose() method attempts to acquire the lock in the main thread.
  10. The main thread blocks because the secondary thread still has the lock.
  11. The secondary thread never releases the lock.
  12. Both threads are blocked indefinitely.

The evidence I have gathered so far supports this theory. I have not been able to reproduce this problem the way it happens in the wild. However, I have been able to simulate this scenario with some small code changes. I have created a bugdemo-01 branch in my fork that demonstrates this scenario. You can find the code here:

jkillingsworth@d8e46c8

If you want a local copy, you can clone it like this:

git clone --branch bugdemo-01 https://github.com/jkillingsworth/SSH.NET.git

There is a RunBugDemo project I added to the solution. You can build and run the code to reproduce this scenario.

screenshot-01

When you run it, the process hangs indefinitely. You can use task manager to capture a dump of the hung process.

screenshot-02

You can open the dump file with WinDbg for analysis. If you enter the !threads command, you can see all the threads in the process.

screenshot-03

Keep in mind the thread identifiers might be different for each unique dump. Also notice the lock counts in the screenshot above.

Enter ~0s and then !clrstack to see the stack trace for the main thread.

screenshot-04

Observe System.Threading.Monitor.Enter where it is waiting for the lock.

Enter ~4s and then !clrstack to see the stack trace for the secondary thread.

screenshot-05

Observe System.Net.Sockets.Socket.Receive where it is waiting to read data from the socket.

Debugging

I have included three hardcoded breakpoints in my demo code that simulates this bug. These breakpoints are placed at the key points necessary to understand this issue. You can run the RunBugDemo project with a debugger attached to hit the breakpoints. If using Visual Studio, you might find the thread debugging tools helpful.

From the menu bar

Debug ⟶ Windows ⟶ Parallel Stacks
Debug ⟶ Windows ⟶ Threads

You can open these tool windows when the debugger is running.

Breakpoint #1

Line 1204. While performing the list directory operation, the secondary thread acquires a lock on the _socketReadLock object.

screenshot-06

Line 241. The main thread waits while the secondary thread continues.

screenshot-07

Breakpoint #2

Line 256. The main thread throws an exception after the timeout interval has elapsed.

screenshot-08

Line 309. The secondary thread is trying to read from a socket when the timeout exception is thrown in the main thread.

screenshot-09

Breakpoint #3

Line 1798. The main thread tries to acquire a lock on the _socketReadLock object. It blocks here because the secondary thread still has the lock. Notice in the Parallel Stacks pane that this is deep inside the Dispose() call.

screenshot-10

Line 309. The secondary thread is still trying to read from a socket. The socket read operation never completes, and so the secondary thread never releases the lock.

screenshot-11

Solution

I don't fully understand the threading model for this library, so I am not in a position to propose the best possible solution. However, I think it would be wise to use finite timeouts when performing socket operations. Right now, the socket timeouts are hardcoded to use an infinite timeout value with no way for the user to override. A hardcoded timeout of, say, 30 seconds would be a more reasonable value in my estimation.

The following screenshot shows the expected behavior of my demo program.

screenshot-12

Instead of hanging indefinitely, I would expect it to throw an exception if there is a problem reading from the socket. My application can handle exceptions such as these without manual intervention. It is the infinite hangs that are causing problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions