Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observation: docker system dial-stdio processes do not die #837

Open
wdiechmann opened this issue Jun 12, 2024 · 10 comments
Open

observation: docker system dial-stdio processes do not die #837

wdiechmann opened this issue Jun 12, 2024 · 10 comments

Comments

@wdiechmann
Copy link

not sure what is going on - but my observation is a slowly "degenerating system" as I keep deploying; if this is just me (not knowing to be) high as a kite on ethanol, please apologise me wasting your bandwidth 🙏

Symptoms

Either deploys fail - or takes forever - and service response is measurably below par

Diagnostics

root@ubuntu-4gb-hel1-mortimer-1:~# ps ax
...8<...
3479828 ?        Ssl    0:00 docker system dial-stdio
3479855 ?        Ssl    0:00 docker system dial-stdio
3479861 ?        Ss     0:00 sshd: root@notty
3479928 ?        Ssl    0:00 docker system dial-stdio
3479946 ?        Ssl    0:00 buildctl dial-stdio
3480065 ?        Ss     0:00 sshd: root@pts/0
3480118 ?        I      0:00 [kworker/1:1-events]
3480138 pts/0    Ss     0:00 -bash
3480958 ?        I      0:00 [kworker/u4:3-flush-8:0]
3481521 pts/0    R+     0:00 ps ax
root@ubuntu-4gb-hel1-mortimer-1:~# ps ax | grep dial-stdio | wc -l
99
root@ubuntu-4gb-hel1-mortimer-1:~# shutdown -r now
...8<...
root@ubuntu-4gb-hel1-mortimer-1:~# ps ax | grep dial-stdio | wc -l
1

Remediation

I'm barking up the kamal communicates via the npipe helped by docker system dial-stdio tree - suspecting the "remote" process not knowing when to exit so hangs around indefinitely - just a (wild) guess 🤷🏻‍♂️

Somehow signaling the process to 'go die' would perhaps solve the matter - in a perfect world not until the deploy has finished (either exit 0 or exit something) but otherwise after each command --

Reproduction

All I do is kamal env push && kamal deploy - once/twice pr 2hr slot - effectively demanding a reboot every other day

#/config/deploy.yml
    ....8<...

builder:
  remote:
    arch: arm64
    host: ssh://bob_the_builder@1.2.3.4

# Deploy to these servers.
servers:
  web:
    hosts:
      - 1.2.3.4
    options:
    ....8<...

ssh:
  user: bob_the_builder

System

it's a rental, what can I say 😉

happy user of Hetzner services

root@ubuntu-4gb-hel1-mortimer-1:~# uname -a
Linux ubuntu-4gb-hel1-mortimer-1 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:51:32 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

and the ruby/rails env is

rails@e8d5d7728a6a:/rails$ bin/rails -v
Rails 8.0.0.alpha
rails@e8d5d7728a6a:/rails$ ruby -v
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]

and finally Kamal is

√ bellis % kamal version
1.3.1
@djmb
Copy link
Collaborator

djmb commented Jun 13, 2024

What are you running on ubuntu-4gb-hel1-mortimer-1? Is it used as the remote builder?

@wdiechmann
Copy link
Author

wdiechmann commented Jun 13, 2024 via email

@djmb
Copy link
Collaborator

djmb commented Jun 13, 2024

I think the docker system dial-stdio processes are related to the connections to your remote builder then. We have seen similar problems with ours. Looks maybe a bit like this - https://forums.docker.com/t/docker-continuously-making-unnecessary-ssh-connections-to-remote-servers/136132?

For now I'd suggest moving the remote builder to it's own server to avoid affecting your app.

@wdiechmann
Copy link
Author

wdiechmann commented Jun 13, 2024 via email

@wdiechmann
Copy link
Author

disclaimer:
I've not yet enjoyed Kamal 2.0 - and I do build on the (test) host - which may add significantly to the number of open connections

If anyone ends up here either b/c the host dies (well it is less dramatic than that but it runs out of memory so it is more like it ends up in a kind of "coma" -- a kamal infused coma you might say) under the weight of dial-stdio or realizing that the client (your CPU - mac/linux/pc) keeps every single ssh connection used by Kamal open 'till hell freezes, or you show them the kitchen door 😉

This is the clean ups:

# your client
kill -HUP `ps aux | grep 'ConnectTimeout' | awk '{ print $2}'`

# host
kill $(ps ax | grep 'docker system dial-stdio' | grep Ssl | awk '{print $1}')

@rogermarlow
Copy link

I am having similar problems, although I think there are two separate problems here.

I use kamal 2.1.1 to deploy from my MBP to three low cost Ubuntu servers. The one chosen as the remote builder runs out of memory in something like 18-24 hours and has to be restarted, after which the process repeats. If I manage to get onto the server before it runs out of memory there are around 100 docker system dial-stdio processes. They appear to be created at a rate of four per hour. This is problem 1.

Problem 2 is that there are also lots of processes on the laptop from which I deploy, 56 as I write, of the form:
ssh -o ConnectTimeout=30 -l [deploy user] -- [IP of server 1, 2 or 3] docker system dial-stdio
All apparently left over from past deployments and easy to clean-up. They are not connected to the processes on the remote build server, if I kill the laptop processes the remote build processes are still created. If I stop the buildkit container the processes are still spawned. Currently I don't know what causes this continual spawning and I just have to kill them manually if I want the server to remain up.

@rogermarlow
Copy link

Update: the problem is docker buildx running locally. It connects every few minutes to the remote host, not sure why, I don't want a build, but it leaves a docker system dial-stdio process on the remote build host. And it's not just buildx running on my laptop, it is buildx running on the laptops of all the developers working on this project. (docker buildx stop .... does not stop the builder). I have resorted to cronjobs that delete the processes every 30 mins on the client and server.

This was raised in March in a Docker community forum post.

@rogermarlow
Copy link

Update 2: as we don't strictly need to use a remote builder, we dropped the remote option and build locally instead. We also had to go into Docker Desktop for every developer that had deployed and remove the remote builders (Settings -> Builders). Once we cleaned up the dial-stdio processes on the remote build machine we have rock-steady memory usage.

@wdiechmann
Copy link
Author

@rogermarlow I fiddled on with the script(s) and now it looks like this (I have one named prod addressing deployment to production, too) and with this I can have my cake and eat it too 😄 (building remotely without risking exhausting the host)

# bin/stage
ssh docker5 ls
kamal env push --destination=staging
kamal deploy --destination=staging
echo Cleaning SSH local: `ps aux | grep 'ConnectTimeout' | wc -l` remote: `ssh docker5 -lroot "ps aux | grep 'ConnectTimeout' | wc -l"`
clean_ssh 2>&1 > /dev/null
echo Cleaned SSH local: `ps aux | grep 'ConnectTimeout' | wc -l` remote: `ssh docker5 -lroot "ps aux | grep 'ConnectTimeout' | wc -l"`
# bin/clean_ssh
kill -HUP `ps aux | grep 'ConnectTimeout' | awk '{ print $2}'` 2>&1 > /dev/null
ssh docker5 -lroot "kill \$(ps ax | grep 'docker system dial-stdio' | awk '{print \$1}')" 2>&1 > /dev/null

notes:
Line 2 in the stage script "wakes up" the VM on Hetzner -- it's not necessary if youo're not 'on the cheap' 😆
Lines 5,7 are only for reporting - not necessary
Line 2 in the clean_ssh script cleans local processes
Line 3 does the same on the host

@jeremy
Copy link
Member

jeremy commented Nov 2, 2024

Perhaps Kamal could stop the builder when its work is done:

docker buildx stop kamal-remote-ssh--username-hostname

Can try that out in your app with a post-deploy hook. In .kamal/hooks/post-deploy:

#!/bin/bash
docker buildx stop kamal-remote-ssh--yourbuilderusername-yourbuilderhostname

(Note: assumes Kamal 2 builder naming convention. Adjust for older Kamal 1 builder names like kamal-$service-native-remote)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants