Skip to content

Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096

Closed

Description

Description

In Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).

Reproduction Steps

To fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:

  • build Nethermind from source performance/new-payload-no-gc branch
  • install consensus client fe. Lighthouse:
  • create jwtsecret file containing a random 64 character hex string
  • run consesuns client: ./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse
  • run from src/Nethermind/Nethermind.Runner Nethermind: dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"
  • (unfortunatelly) wait until node becomes synced (it will take a day or two)

When node becomes synced, it should deadlock eventually after few minutes or hours.

Expected behavior

Appliction works :)

Actual behavior

Applcation deadlocks.

Regression?

No response

Known Workarounds

No response

Configuration

Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM
.NET 7.0.4

Other information

I've attached lldb to a deadlocked Nethermind.Runner process, please find attached my investigation containing thread backtrace all merged with clrstack for managed parts (manually): stacks.txt

As you can find there, all GC threads are waiting on gc_t_join.join in the mark_phase, while thread 172 is waiting on wait_for_gc_done from the GC.Collect comming from ScheduleGC in:

public void Dispose()
{
    if (_failCause == FailCause.None)
    {
        if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
        {
            try
            {
                System.GC.EndNoGCRegion();
                _gcKeeper.ScheduleGC();
            }
            catch (InvalidOperationException)
            {
                if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
            }
            catch (Exception e)
            {
                if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
            }
        }
        else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
    }
    else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
}

Most of other threads are just waiting, in a typical state, I'd say.

In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex 0x0000000a00000004 info and whether it is even a good direection.

I still keep lldb attached to deadlocked process, happy to investigate futher if you'd drive me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions