Description
Describe the bug
zosv3light node 7403 running on a hetzner dedicated server was running fine until I deployed a workload via terraform. A full vm with mycelium and ygg enabled.
saw this on 2 other nodes as well, same exact behavior. Once deployed the vm works over mycelium for a few minutes and then becomes unreachable, however the issue does not seem to be related to mycelium. In loki one can find that dhcp-zos lost it's carrier, and therefore removed it's default routes etc ..
Therefore it becomes impossible to remove the workload via terraform since the node is unreachable. I have to reboot the zos node and then remove the deployment. If after a reboot one waits a few minutes, the same pattern repeats until the deployment is removed. Deploying the same terraform on other nodes does not create this issue.
To Reproduce
Steps to reproduce the behavior:
1. deploy below terraform on mainnet node 7403
2. watch the loki logs: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-5m%22,%22to%22:%22now%22%7D%7D
3. watch the metrics stop: https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-1h&to=now&timezone=browser
4. try to reach your deployment and test if it stays online
Expected behavior
normal node / deployment operation
Screenshots
terraform:
terraform {
required_providers {
grid = {
source = "threefoldtech/grid"
}
}
}
provider "grid" {
}
resource "random_bytes" "mycelium_ip_seed" {
length = 6
}
resource "random_bytes" "mycelium_key" {
length = 32
}
resource "grid_network" "net1" {
nodes = [7403]
ip_range = "10.212.0.0/16"
name = "myceiperf2"
description = "myceiperf2"
add_wg_access = true
mycelium_keys = {
format("%s", 7403) = random_bytes.mycelium_key.hex
}
}
resource "grid_deployment" "d1" {
node = 7403
network_name = grid_network.net1.name
disks {
name = "root"
size = 25
}
vms {
name = "myceiperf2"
flist = "https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist"
cpu = 4
planetary = true
publicip = false
publicip6 = false
memory = 8192
# entrypoint = "/sbin/zinit init"
mounts {
name = "root"
mount_point = "/data"
}
env_vars = {
SSH_KEY ="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
}
mycelium_ip_seed = random_bytes.mycelium_ip_seed.hex
}
}
output "wg_config" {
value = grid_network.net1.access_wg_config
}
output "node1_vm1_ip" {
value = grid_deployment.d1.vms[0].ip
}
output "public_ip" {
value = grid_deployment.d1.vms[0].computedip
}
output "public_ip6" {
value = grid_deployment.d1.vms[0].computedip6
}
output "planetary_ip" {
value = grid_deployment.d1.vms[0].planetary_ip
}
output "vm1_mycelium_ip" {
value = grid_deployment.d1.vms[0].mycelium_ip
}