Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis failure: LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069 #3986

Open
deepfire opened this issue Jun 2, 2022 · 10 comments
Assignees
Labels
bug Something isn't working comp: benchmark era: babbage type: blocker Major issue, the tagged version cannot be released type: internal feature Non user-facing functionality user type: internal Created by an IOG employee Vasil_blocker Vasil

Comments

@deepfire
Copy link
Contributor

deepfire commented Jun 2, 2022

processing ../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/logs-node-9.flt.json
{ "tip": "a1fad4dfb8fc82410026f8edd7008af0c1e32b50837689e2dd378c3a14d0f65d" }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain-raw.txt" }
{ "filtered-chain-slot-domain": {"ddFilteredCount":39953,"ddFilteredFirst":16026,"ddFilteredLast":55979,"ddRawCount":65619,"ddRawFirst":865,"ddRawLast":66484} }
{ "filtered-chain-block-domain": {"ddFilteredCount":1815,"ddFilteredFirst":739,"ddFilteredLast":2554,"ddRawCount":3074,"ddRawFirst":0,"ddRawLast":3073} }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain.txt" }

locli: LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
CallStack (from HasCallStack):
  error, called at src/Cardano/Analysis/MachPerf.hs:343:10 in locli-1.29-inplace:Cardano.Analysis.MachPerf
real    9m8.705s
@deepfire deepfire self-assigned this Jun 2, 2022
@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

run 2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs on AWS

@deepfire deepfire added bug Something isn't working Vasil Vasil_blocker labels Jun 2, 2022
@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

The logic of error here is that we're seeing a LOTraceLeadershipDecided before LOTraceStartLeadershipCheck -- which indeed is weird, if true and if locli didn't trip over itself in some way.

I've misread the basic message, see down below.

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

Whether it's true or not, can be determined from the source of truth -- the logs from the particular node -- however the error message is unhepfully not mentioning the particular node -- so we'll need to fix the error reporting here.

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

Ok, so we've ran the new locli on an old run, which we're 100% sure has all log messages proper -- and we've got the same failure.

So it is a regression in legacy trace handling.

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

Ok, giving --dump-logobjects to wb a will give us insight on how locli views the stream of logobjects:

wb --cls a --filters size-small --dump-logobjects std 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

Ok, so we know it's the performance-analysis portion of the analysis that fails (aka MachPerf in haskell speak).

Next thing -- iterating over all the gigabytes of logs is slow, and we know the host, whose logs trigger this, so we now have a new subcommand -- performance-single-host (or perf-single for short):

wb --cls a --filters size-small perf-single node-0 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

@deepfire
Copy link
Contributor Author

deepfire commented Jun 2, 2022

What does the error message say?

LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069

What does it mean, on the most basic level?

It means, that we've encountered a LOTraceLeadershipDecided with a slot number that is behind the analysis tip (cur)).

Hmm, so the handler for LOTraceLeadershipDecided expects to be operating on tip always.
Is that a reasonable assumption?

No, it's not -- there can be things that legitimately progress analysis -- and break this assumption:

  1. https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L247
  2. https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L257

Of this, the second one is not the culprit -- LOTraceStartLeadershipCheck always happens before LOTraceLeadershipDecided for any given slot.

Hence, it must be the first one, for LOResources -- and those are indeed asynchronous -- there is no ordering wrt any other messages.

deepfire added a commit that referenced this issue Jun 2, 2022
deepfire added a commit that referenced this issue Jun 2, 2022
deepfire added a commit that referenced this issue Jun 2, 2022
deepfire added a commit that referenced this issue Jun 3, 2022
deepfire added a commit that referenced this issue Jun 3, 2022
deepfire added a commit that referenced this issue Jun 5, 2022
deepfire added a commit that referenced this issue Jun 5, 2022
@deepfire deepfire closed this as completed Jun 6, 2022
deepfire added a commit that referenced this issue Jun 6, 2022
deepfire added a commit that referenced this issue Jun 6, 2022
deepfire added a commit that referenced this issue Jun 7, 2022
deepfire added a commit that referenced this issue Jun 7, 2022
deepfire added a commit that referenced this issue Jun 7, 2022
deepfire added a commit that referenced this issue Jun 13, 2022
deepfire added a commit that referenced this issue Jun 13, 2022
@deepfire
Copy link
Contributor Author

Ok, this fiasco with node-51 had another consequence -- crazy stats:

$ jq -C '.sResourceCDFs | .[] | select(objects) | select(.cdfAverage > 100000) | { cdfAverage: .cdfAverage, loc: input_filename }' logs-node-*.flt.perf-stats.json
{
  "cdfAverage": 261244935.45651916,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145198.9159274,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145168.1841264,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145153.52692,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429010.34,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429033.2,
  "loc": "logs-node-51.flt.perf-stats.json"
}
``

@deepfire
Copy link
Contributor Author

The easy solution is to just exclude node-51 from machine performance analysis, only keeping its BlockProp contribution -- without which we can't reconstruct the blockchain.

This would need some analyse.sh extension -- since the same data files are fed into the same processing anaconda..

@deepfire deepfire reopened this Jun 24, 2022
@dorin100 dorin100 added type: blocker Major issue, the tagged version cannot be released user type: internal Created by an IOG employee comp: benchmark era: babbage type: internal feature Non user-facing functionality labels Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp: benchmark era: babbage type: blocker Major issue, the tagged version cannot be released type: internal feature Non user-facing functionality user type: internal Created by an IOG employee Vasil_blocker Vasil
Projects
None yet
Development

No branches or pull requests

2 participants