Description
Here is a CI failure: https://github.com/oxidecomputer/omicron/runs/7887995648
This failure occurred shortly after we added sock
to Buildomat to work alongside buskin
for lab jobs. This job ran on sock
, and ddmd picked up a prefix during startup:
SledAgent (RSS): Received prefixes from ddmd
prefixes = {"fe80::8:20ff:fe9e:7b26": [Ipv6Prefix { addr: fd00:1122:3344:1::, mask: 64 }, Ipv6Prefix { addr: fdb0:18c0:4d0d:9fb2::, mask: 64 }, Ipv6Prefix { addr: fd00:1122:3344:101::, mask: 64 }]}
This prefix was advertised by ddmd on buskin, during https://github.com/oxidecomputer/omicron/runs/7887685952:
SledAgent: Sending prefix to ddmd for advertisement
prefix = Ipv6Prefix { addr: fdb0:18c0:4d0d:9fb2::, mask: 64 }
The result is that the job on sock created a plan for two sled agents, I think, then failed to start sled agent because it was already running on buskin?
There is a timestamp discrepancy in the logs; either sock's time is one hour in the future, or... well, I'm not sure what else it'd be. Some sort of lab network state being cached somewhere?
@jclulow is changing the lab network so that sock and buskin are on two separate VLANs, which seems like the correct configuration for CI, so not labeling this as a test flake. Instead I'd like to know if we want to prevent this kind of thing from happening elsewhere, e.g. someone testing two separate control plane installations at once on their home network without intending for them to talk to each other, or if that's just a bad idea.