-
Notifications
You must be signed in to change notification settings - Fork 62
Description
The best thing about getting smarter everyday is that you realize how bad your decisions were yesterday. 😂
Context
With the multi-switch work, we needed to provide Nexus with a way to know which switch zone was managing which switch slot (the top switch or the bottom switch). This information was not available in DNS at the time. However, mgs is able to provide the information, and it is co-resident with the other switch zone services. The decided approach was to look up the dendrite instances via DNS, and then determine which physical switch they are managing via mgs. This seems to have worked out well.
The Problem
In my infinite wisdom, I stashed the generated clients with their location data in a HashMap that nexus holds.
Lines 181 to 185 in 47a416f
| /// Mapping of SwitchLocations to their respective Dendrite Clients | |
| dpd_clients: HashMap<SwitchLocation, Arc<dpd_client::Client>>, | |
| /// Map switch location to maghemite admin clients. | |
| mg_clients: HashMap<SwitchLocation, Arc<mg_admin_client::Client>>, |
The problem with this is if (when) a customer swaps a scrimlet, this data will not be updated. The client will still point to the address of the old scrimlet. Since sleds keep their addresses with them, this means if someone swaps a scrimlet and non-scrimlet with each other, the network configuration requests will now go to a non-scrimlet. Or even wilder, if they ever swap the two scrimlets, the configurations for each switch will go to the wrong one. We could probably update the HashMap if we jumped through enough hoops, but I think we're all wanting it to go away at this point.
Proposed Solutions
In the near term this can be mitigated in Nexus by:
- No longer using the clients stored in
HashMap<SwitchLocation, Client>and instead queryingmgseach time before sending any configurations. Since a majority of switch configurations are being moved to RPWs, this shouldn't add a lot of overhead.
In the long term this might be more elegantly solved by:
- Registering switch zone services in DNS with information that allows us to determine what rack and switch slot they are managing.