-
Notifications
You must be signed in to change notification settings - Fork 20
2023-02-17-3e0550b hangs after a while #41
Comments
@lidel Can you get me some Bifrost logs to investigate ? |
This time froze after ~15k block requests:
|
the metric being reported here is: 0 means the that can be because of any non-200 error code, currently: https://github.com/filecoin-saturn/caboose/blob/56920bcce7c3e8e25c3ed9b58bb61c13d858ef5e/pool.go#L392-L394 we should fix that |
I added commit filecoin-saturn/caboose@15dc34f to the caboose branch to set codes before erroring. with that commit you should get a better breakdown of the cause of code 0 |
@willscott fysa most of the time 0 means hitting context timeout (failed to retrieve block under 60s), your fix will help, but will not fix that case: timeouts will still be shown as 0. |
I was able to reproduce deadlock with 1000 L1s from ring cohort and Deadlock got hit very early:
|
deadlock means you see those numbers stop increasing (or slowing way down) along with timeouts at that point? |
|
I feel this will be hard to debug without metrics about the caboose pool workers, how many are there, how many are free, how many are in-flights etc. It looks like lack of timeout or releasing resources somewhere (no worker left to pick up work, everyone waiting, but no clear error). |
Closes #41 by applying fixes from filecoin-saturn/caboose#19
Closes #41 by applying fixes from filecoin-saturn/caboose#19
filecoin-saturn/caboose#30 fixed panic, but we see a different problem now:
runs for a few minutes, but then entire binary dies around ~4k requests:
If there is no easy fix, I'd have to revert caboose updates before the EOD and run with the old version.
The text was updated successfully, but these errors were encountered: