Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

2023-02-17-3e0550b hangs after a while #41

Closed
lidel opened this issue Feb 17, 2023 · 10 comments · Fixed by #44
Closed

2023-02-17-3e0550b hangs after a while #41

lidel opened this issue Feb 17, 2023 · 10 comments · Fixed by #44

Comments

@lidel
Copy link
Collaborator

lidel commented Feb 17, 2023

Upstream issue: filecoin-saturn/caboose#31

filecoin-saturn/caboose#30 fixed panic, but we see a different problem now:

bifrost-bank1-ny:/data# curl http://127.0.0.1:8080/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  116k  100  116k    0     0  1228k      0 --:--:-- --:--:-- --:--:-- 1231k

runs for a few minutes, but then entire binary dies around ~4k requests:

bifrost-bank1-ny:/data# curl http://127.0.0.1:8041/debug/metrics/prometheus -s | grep caboose_fetch_err
# HELP ipfs_caboose_fetch_errors Errors fetching from Caboose Peers
# TYPE ipfs_caboose_fetch_errors counter
ipfs_caboose_fetch_errors{code="0"} 3563
ipfs_caboose_fetch_errors{code="200"} 369
bifrost-bank1-ny:/data# curl http://127.0.0.1:8080/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) Recv failure: Connection reset by peer

If there is no easy fix, I'd have to revert caboose updates before the EOD and run with the old version.

@aarshkshah1992
Copy link
Collaborator

@lidel Can you get me some Bifrost logs to investigate ?

@lidel lidel moved this to 🏗 In progress in bifrost-gateway Feb 17, 2023
@lidel
Copy link
Collaborator Author

lidel commented Feb 17, 2023

This time froze after ~15k block requests:

ipfs_caboose_fetch_errors{code="0"} 12721
ipfs_caboose_fetch_errors{code="200"} 2822

@willscott
Copy link
Collaborator

the metric being reported here is:
https://github.com/filecoin-saturn/caboose/blob/56920bcce7c3e8e25c3ed9b58bb61c13d858ef5e/pool.go#L345

0 means the doFetch function is returning before a code is set.

that can be because of any non-200 error code, currently: https://github.com/filecoin-saturn/caboose/blob/56920bcce7c3e8e25c3ed9b58bb61c13d858ef5e/pool.go#L392-L394

we should fix that

@willscott
Copy link
Collaborator

I added commit filecoin-saturn/caboose@15dc34f to the caboose branch to set codes before erroring. with that commit you should get a better breakdown of the cause of code 0

@lidel
Copy link
Collaborator Author

lidel commented Feb 17, 2023

@willscott fysa most of the time 0 means hitting context timeout (failed to retrieve block under 60s), your fix will help, but will not fix that case: timeouts will still be shown as 0.

@lidel
Copy link
Collaborator Author

lidel commented Feb 17, 2023

I was able to reproduce deadlock with 1000 L1s from ring cohort and ab -k -n 20000 -c 100 -w "http://en.wikipedia-on-ipfs.org.ipns.localhost:8081/wiki/Ottoman_Empire" (/wiki/ is HAMT, so multiple blocks) and --block-cache-size 2 (to avoid cache hits and send more requests and hit deadlock faster)

Deadlock got hit very early:

ipfs_caboose_fetch_errors{code="0"} 722
ipfs_caboose_fetch_errors{code="200"} 747

@willscott
Copy link
Collaborator

deadlock means you see those numbers stop increasing (or slowing way down) along with timeouts at that point?

@lidel
Copy link
Collaborator Author

lidel commented Feb 17, 2023

ipfs_caboose_fetch_errors numbers stop (frozen) and new requests just hang and wait (i assume in prod the number of available slots gets used pretty fast, and then it just errors for new ones):

> curl -v "http://en.wikipedia-on-ipfs.org.ipns.localhost:8081/wiki/Ottoman_Empire"
* Connected to en.wikipedia-on-ipfs.org.ipns.localhost (127.0.0.1) port 8081 (#0)
> GET /wiki/Ottoman_Empire HTTP/1.1
> Host: en.wikipedia-on-ipfs.org.ipns.localhost:8081
> User-Agent: curl/7.87.0
> Accept: */*
>

@lidel
Copy link
Collaborator Author

lidel commented Feb 17, 2023

I feel this will be hard to debug without metrics about the caboose pool workers, how many are there, how many are free, how many are in-flights etc.

It looks like lack of timeout or releasing resources somewhere (no worker left to pick up work, everyone waiting, but no clear error).

lidel added a commit that referenced this issue Feb 18, 2023
lidel added a commit that referenced this issue Feb 18, 2023
@lidel lidel closed this as completed in #44 Feb 18, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in bifrost-gateway Feb 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants