Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running migrations can cause systemd to terminate daemon on start #7269

Closed
Stebalien opened this issue May 4, 2020 · 9 comments · Fixed by #8040
Closed

Long running migrations can cause systemd to terminate daemon on start #7269

Stebalien opened this issue May 4, 2020 · 9 comments · Fixed by #8040
Labels
exp/intermediate Prior experience is likely helpful help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws) P3 Low: Not priority right now status/ready Ready to be worked topic/daemon + init topic/repo Topic repo

Comments

@Stebalien
Copy link
Member

Version information:

0.5.0

Description:

Now that go-ipfs supports systemd's "notification" system, we need to tell systemd to repeatedly extend time startup timeout while performing repo migrations. Otherwise, systemd may kill the daemon thinking it timed out on startup.

We can do this by repeatedly sending EXTEND_TIMEOUT_USEC=... to systemd's notification socket using github.com/coreos/go-systemd/v22/daemon. See cmd/ipfs/daemon_linux.go for how we interact with systemd's notification service.

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) help wanted Seeking public contribution on this issue P3 Low: Not priority right now exp/intermediate Prior experience is likely helpful labels May 4, 2020
@RubenKelevra
Copy link
Contributor

RubenKelevra commented May 4, 2020

Can we notify systemd the same way while we open the database on startup (which is basically the reason for the slow startup documented here and on shutdown when we for example run database compaction jobs, flush data to disk, or clean up sockets etc.?

@Stebalien
Copy link
Member Author

@Stebalien
Copy link
Member Author

However, we don't do database compactions on stop (explicitly disabled for this reason) so we shouldn't delay there. If IPFS gets stuck on shutdown, we should just die and cleanup when we next restart.

@RubenKelevra
Copy link
Contributor

I've seen ipfs hit the one minute 30-second limit dozens of times on different machines, without badger db. My guess is, that it's a TCP cleanup issue.

I think it's still nicer to clean up everything cleanly and as long as we make progress we notify systemd.

Maybe add a hard limit, after which we stop notifying on shutdown, to make sure we don't hang indefinitely.

@Stebalien
Copy link
Member Author

Odd. We shouldn't be spending any time cleaning up TCP connections or anything like that, unless we have a bug somewhere. Can you reproduce this?

@RubenKelevra
Copy link
Contributor

Odd. We shouldn't be spending any time cleaning up TCP connections or anything like that, unless we have a bug somewhere. Can you reproduce this?

Well, I converted all my nodes to badgerds with the 0.5.0 release.

I need to convert one back to see how it can be reproduced.

@RubenKelevra
Copy link
Contributor

@Stebalien

I've seen ipfs hit the one minute 30-second limit dozens of times on different machines, without badger db. My guess is, that it's a TCP cleanup issue.

I can report this bug seems to be gone. Not sure when, but I haven't noticed it at all on 0.8.

Apart from this, wouldn't it make sense to send go-ipfs a SIGABRT instead of a SIGKILL by systemd by default if this happens? This gives the user a stack trace to share with us?

@Stebalien
Copy link
Member Author

Stebalien commented Mar 31, 2021 via email

@Stebalien
Copy link
Member Author

Ok, I'm actually just going to disable the startup timeout. It's not helping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/intermediate Prior experience is likely helpful help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws) P3 Low: Not priority right now status/ready Ready to be worked topic/daemon + init topic/repo Topic repo
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants