Skip to content

Separate installations on the same network can interfere with each other #1639

Open
@iliana

Description

@iliana

Here is a CI failure: https://github.com/oxidecomputer/omicron/runs/7887995648

This failure occurred shortly after we added sock to Buildomat to work alongside buskin for lab jobs. This job ran on sock, and ddmd picked up a prefix during startup:

SledAgent (RSS): Received prefixes from ddmd
    prefixes = {"fe80::8:20ff:fe9e:7b26": [Ipv6Prefix { addr: fd00:1122:3344:1::, mask: 64 }, Ipv6Prefix { addr: fdb0:18c0:4d0d:9fb2::, mask: 64 }, Ipv6Prefix { addr: fd00:1122:3344:101::, mask: 64 }]}

This prefix was advertised by ddmd on buskin, during https://github.com/oxidecomputer/omicron/runs/7887685952:

SledAgent: Sending prefix to ddmd for advertisement
    prefix = Ipv6Prefix { addr: fdb0:18c0:4d0d:9fb2::, mask: 64 }

The result is that the job on sock created a plan for two sled agents, I think, then failed to start sled agent because it was already running on buskin?

There is a timestamp discrepancy in the logs; either sock's time is one hour in the future, or... well, I'm not sure what else it'd be. Some sort of lab network state being cached somewhere?

@jclulow is changing the lab network so that sock and buskin are on two separate VLANs, which seems like the correct configuration for CI, so not labeling this as a test flake. Instead I'd like to know if we want to prevent this kind of thing from happening elsewhere, e.g. someone testing two separate control plane installations at once on their home network without intending for them to talk to each other, or if that's just a bad idea.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions