Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 config_url is saved as part of the config during netboot #885

Closed
Tracked by #782 ...
mudler opened this issue Feb 14, 2023 · 6 comments · Fixed by #957
Closed
Tracked by #782 ...

🐛 config_url is saved as part of the config during netboot #885

mudler opened this issue Feb 14, 2023 · 6 comments · Fixed by #957
Assignees
Labels
area/agent bug Something isn't working lane/ux

Comments

@mudler
Copy link
Member

mudler commented Feb 14, 2023

Kairos version:
1.5

CPU architecture, OS, and Version:

Describe the bug
When doing automated installs from network, for instance with AuroraBoot, the cloud config file used for bootstrapping it is passed as a config_url, which ends up to be saved into the /oem/90_custom.yaml file. The problem is that if we kill AuroraBoot, kairos will try to fetch the URL again and fail, preventing the agent to start

To Reproduce
Use auroraboot, like the following, and after the machine installation completes, kill the auroraboot process:

cat <<EOF | docker run --rm -i --net host quay.io/kairos/auroraboot \
                    --cloud-config - \
                    --set "container_image=quay.io/kairos/kairos-opensuse-leap:v1.5.1-k3sv1.21.14-k3s1"
#cloud-config
install:
 auto: true
 device: "auto"
 reboot: true

hostname: kairoslab-{{ trunc 4 .MachineID }}
users:
- name: kairos
  passwd: kairos
  ssh_authorized_keys:
  # Replace with your github user and un-comment the line below:
  - github:mudler
  - github:mauromorales

p2p:
 # Disabling DHT makes co-ordination to discover nodes only in the local network
 disable_dht: true #Enabled by default

 # network_token is the shared secret used by the nodes to co-ordinate with p2p.
 # Setting a network token implies auto.enable = true.
 # To disable, just set auto.enable = false
 network_token: "YOUR_TOKEN_HERE"
EOF

To workaround, add config_url to the supplied cloud config, like so:

cat <<EOF | docker run --rm -i --net host quay.io/kairos/auroraboot \
                    --cloud-config - \
                    --set "container_image=quay.io/kairos/kairos-opensuse-leap:v1.5.1-k3sv1.21.14-k3s1"
#cloud-config

config_url: ""
#### WORKAROUND ^^^

install:
 auto: true
 device: "auto"
 reboot: true

hostname: kairoslab-{{ trunc 4 .MachineID }}
users:
- name: kairos
  passwd: kairos
  ssh_authorized_keys:
  # Replace with your github user and un-comment the line below:
  - github:mudler
  - github:mauromorales

p2p:
 # Disabling DHT makes co-ordination to discover nodes only in the local network
 disable_dht: true #Enabled by default

 # network_token is the shared secret used by the nodes to co-ordinate with p2p.
 # Setting a network token implies auto.enable = true.
 # To disable, just set auto.enable = false
 network_token: "YOUR_TOKEN_HERE"
EOF

Expected behavior
The kairos-agent should at least ignore such errors and print a warning instead so the deployment is not stopped. If there is something going wrong, a user having issues will already look into logs.

Open question What about the config_url by itself? does it make sense to save it at all from the cmdline?

@mudler mudler added the bug Something isn't working label Feb 14, 2023
@mudler mudler assigned mudler and unassigned mudler Feb 14, 2023
@mudler
Copy link
Member Author

mudler commented Feb 14, 2023

labeled as ux as it doesn't block anything (workaround is known)

@mudler mudler mentioned this issue Feb 14, 2023
37 tasks
@jimmykarily
Copy link
Contributor

As agreed during planning, we will make the agent print a warning when the config_url is not accessible and continue booting.

Ideally it can retry a couple of times before giving up.

Ideally, these kind of warnings should be printed in the boot messages (not journalctl and rd.shell hacks should be needed) but we will do that on a separate issue because it also affects other kind of failures (e.g. bad kcrypt configuration etc).

@jimmykarily
Copy link
Contributor

I tried to reproduce it but although the auroraboot server is not reachable, system eventually boots. These are the relevant logs from sudo journaldctl :

Feb 21 13:38:18 localhost systemd-logind[1353]: Removed session 1.
Feb 21 13:38:19 localhost kairos-agent[1545]: could not merge configs: All attempts fail:
Feb 21 13:38:19 localhost kairos-agent[1545]: #1: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #2: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #3: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #4: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #5: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #6: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #7: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #8: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #9: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost kairos-agent[1545]: #10: Get "http://192.168.1.192:8090/_/file?name=other-1": dial tcp 192.168.1.192:8090: connect: connection refused
Feb 21 13:38:19 localhost systemd[1]: kairos-agent.service: Main process exited, code=exited, status=1/FAILURE
Feb 21 13:38:19 localhost systemd[1]: kairos-agent.service: Failed with result 'exit-code'.
Feb 21 13:38:24 localhost systemd[1]: kairos-agent.service: Scheduled restart job, restart counter is at 1.
Feb 21 13:38:24 localhost systemd[1]: Stopped kairos agent.
Feb 21 13:38:24 localhost systemd[1]: Started kairos agent.

It tried 10 times and then quit.

@mudler
Copy link
Member Author

mudler commented Feb 21, 2023

The system boots, but the service exists, so it doesn't end up the first-boot deployment code (and k3s doesn't start, etc.)

@mudler mudler mentioned this issue Feb 22, 2023
35 tasks
jimmykarily added a commit that referenced this issue Feb 23, 2023
but rather print a Warning.

Fixes #885

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
@jimmykarily
Copy link
Contributor

The fix was simple: https://github.com/kairos-io/kairos/tree/885-flexible-config-url

but I'll check if we can have a test for this.

@jimmykarily
Copy link
Contributor

A manual test for now:

[  OK  ] Finished Record Runlevel Change in UTMP.
WARNING: Couldn't fetch config_url: could not merge configs: All attempts fail:
#1: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#2: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#3: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#4: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#5: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#6: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#7: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#8: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#9: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
#10: Get "https://doesnotexist.com/testestest": x509: certificate is valid for *.netlify.app, netlify.app, not doesnotexist.com
INFO[2023-02-23T13:58:02Z] Starting elemental version 0.20230222.1+kairos 
INFO[2023-02-23T13:58:02Z] Install called                               
INFO[2023-02-23T13:58:02Z] Partitioning device...                       
[  151.181891][ T1173]  vda:

and the installation keeps going (turns out doesnotexist.com does exist :D).

The config.yaml:

#cloud-config

config_url: https://doesnotexist.com/testestest

users:
- name: "kairos"
  passwd: "kairos"
  lock_passwd: true
  groups: "admin"
  ssh_authorized_keys:
  - github:jimmykarily

and the command was: kairos-agent manual-install --device auto config.yaml
I will try to create an automated test for this now.

mudler pushed a commit that referenced this issue Feb 24, 2023
but rather print a Warning.

Fixes #885

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
jimmykarily added a commit that referenced this issue Feb 24, 2023
Don't fail when config_url is not accessible

but rather print a Warning.

Fixes #885

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
Co-authored-by: Dimitris Karakasilis <dimitris@karakasilis.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent bug Something isn't working lane/ux
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants