Deadlock when using TryStartNoGCRegion and/or GC.Collect

### Description

In Nethermind we have [Prevent GC during NewPayload  · Pull Request #5381](https://github.com/NethermindEth/nethermind/pull/5381) which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).

### Reproduction Steps

To fully reproduce it you'd need to start a new Nethermind node as explained in [Running Nethermind & CL - Nethermind Docs](https://docs.nethermind.io/nethermind/first-steps-with-nethermind/running-nethermind-post-merge):
- build Nethermind from source `performance/new-payload-no-gc` branch
- install consensus client fe. Lighthouse:
   - just download binaries from [https://github.com/sigp/lighthouse/releases](https://github.com/sigp/lighthouse/releases)  
   - check if `lighthouse --version` executes
- create `jwtsecret` file containing a random 64 character hex string
- run consesuns client: `./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse`
- run from `src/Nethermind/Nethermind.Runner` Nethermind: `dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"`
- (unfortunatelly) wait until node becomes **synced** (it will take a day or two)

When node becomes synced, it should deadlock eventually after few minutes or hours.

### Expected behavior

Appliction works :)

### Actual behavior

Applcation deadlocks.

### Regression?

_No response_

### Known Workarounds

_No response_

### Configuration

Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM
.NET 7.0.4


### Other information

I've attached `lldb` to a deadlocked Nethermind.Runner process, please find attached my investigation containing `thread backtrace all` merged with `clrstack` for managed parts (manually): [stacks.txt](https://github.com/dotnet/runtime/files/11105552/stacks.txt)

As you can find there, all GC threads are waiting on `gc_t_join.join` in the `mark_phase`, while thread 172 is waiting on `wait_for_gc_done` from the `GC.Collect` comming from `ScheduleGC` in:

```cs
public void Dispose()
{
    if (_failCause == FailCause.None)
    {
        if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
        {
            try
            {
                System.GC.EndNoGCRegion();
                _gcKeeper.ScheduleGC();
            }
            catch (InvalidOperationException)
            {
                if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
            }
            catch (Exception e)
            {
                if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
            }
        }
        else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
    }
    else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
}
```

Most of other threads are just waiting, in a typical state, I'd say.

In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex `0x0000000a00000004` info and whether it is even a good direection.

I still keep `lldb` attached to deadlocked process, happy to investigate futher if you'd drive me.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096

kkokosa
openedon Mar 29, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096

Description

kkokosaopenedon Mar 29, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

kkokosa
openedon Mar 29, 2023