Description
openedon Mar 29, 2023
Description
In Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).
Reproduction Steps
To fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:
- build Nethermind from source
performance/new-payload-no-gc
branch - install consensus client fe. Lighthouse:
- just download binaries from https://github.com/sigp/lighthouse/releases
- check if
lighthouse --version
executes
- create
jwtsecret
file containing a random 64 character hex string - run consesuns client:
./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse
- run from
src/Nethermind/Nethermind.Runner
Nethermind:dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"
- (unfortunatelly) wait until node becomes synced (it will take a day or two)
When node becomes synced, it should deadlock eventually after few minutes or hours.
Expected behavior
Appliction works :)
Actual behavior
Applcation deadlocks.
Regression?
No response
Known Workarounds
No response
Configuration
Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM
.NET 7.0.4
Other information
I've attached lldb
to a deadlocked Nethermind.Runner process, please find attached my investigation containing thread backtrace all
merged with clrstack
for managed parts (manually): stacks.txt
As you can find there, all GC threads are waiting on gc_t_join.join
in the mark_phase
, while thread 172 is waiting on wait_for_gc_done
from the GC.Collect
comming from ScheduleGC
in:
public void Dispose()
{
if (_failCause == FailCause.None)
{
if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
{
try
{
System.GC.EndNoGCRegion();
_gcKeeper.ScheduleGC();
}
catch (InvalidOperationException)
{
if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
}
catch (Exception e)
{
if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
}
}
else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
}
else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
}
Most of other threads are just waiting, in a typical state, I'd say.
In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex 0x0000000a00000004
info and whether it is even a good direection.
I still keep lldb
attached to deadlocked process, happy to investigate futher if you'd drive me.