Add option to enable HA/CARP failover support to the os-netbird plugin#5067
Add option to enable HA/CARP failover support to the os-netbird plugin#5067myah-mitchell wants to merge 10 commits intoopnsense:masterfrom
Conversation
|
@fichtner can we get this reviewed? |
|
Hello, let me help here. I recently did the same thing: Whats also important is not only the transition, but also guarding the service so it cannot start if the current host is not master, otherwise each HA sync will activate it again even after a transition. Simplest way for me was a rc script condition: It's also very important that stdout is not blocking any start or stop here, as we recently had a bug here: The scripts run serialized, so if one script blocks, it blocks all other scripts during failover for 2-3 minutes. If that is all taken into consideration I will review this PR. |
|
Thanks for the feedback! We’ll review and apply your suggestions |
…sync. Redirected stdout on carp hook to prevent any potential blocking.
|
I've tested the above (minus d56d16c) updates on 25.10. I don't currently have a test firewall or any installs on 26.1 to test the mwexec to mwexecfm change (d56d16c) but as far as I understand mwexecfm replaced mwexec in 26.1. I ended up using a postcmd instead of a precmd as Netbird should be running on the secondary firewall, we are just ensuring that Netbird is in a down ( I'll see about getting some firewalls spun up on 26.1 soon if no one else has ones they can test this on. |
|
After some more testing, I've determined that this still is not a full solution. Netbird up/down is actively creating/removing the wt0 interface. Without a "configctl filter reload", firewall rules applied to Netbird do not apply and traffic is blocked. I'll continue working on this next week. |
|
@myah-mitchell thanks for the update, keep us posted :) |
|
The updated NetBird syshook now uses lock files to ensure the This was needed specifically on the This means that the script does cause a blocking state for a few seconds (up to 10 seconds) once per group of CARP state changes. I did not think the alternative of something like the following was a better option. mwexecfb(
'/bin/sh -c "'
. '/usr/local/bin/netbird up;'
. ' i=0; while [ $i -lt 10 ]; do'
. ' if [ -e /dev/wt0 ]; then'
. ' /usr/local/sbin/configctl filter reload;'
. ' exit 0;'
. ' fi;'
. ' sleep 1; i=$((i+1));'
. ' done;'
. '"'
); |
|
Is "netbird up" not idempodent? I would like to prevent any locking in the carp syshooks. If it should only be called once for some reason here is a non blocking trampoline example that was recently used: Though not needing that would be preferred. |
|
As far as I know, yes, it is idempotent. However, the first time I'll take a look at what you linked. |
|
If I'm following, your main concern is the waiting for wt0 and then the filter reload being within the carp syshook. So, move those items to their own script that the syshook can call? Is python the preferred language to use in that case? |
|
Only the logic is important, less the language. You can also use sh or php I don't mind. And yes my main concern is any blocking in carp, please execute as fast as possible since if all other scripts are blocked people will come for us "why is my failover taking n seconds before my services are back up". :) |
…cript folder paths. This change mirrors a commit in OPNsense/ports.
…e background so syshook runs non-blocking.
|
Alright, I've moved the blocking code to be in a separate script so that the syshook process is non-blocking. To prevent delays in NetBird starting after a CARP failover, I've configured the script to run with the first CARP MASTER event instead of the last. On small firewalls with just a few CARP addresses this will save a second or two, but on larger units (we have some firewalls with nearly 100 CARP addresses) this should save much more time resulting in a shorter down time. The way I've set up the "debounce" is to trigger on the first event and then to not trigger again for any event that occurs within an updating 10 second time window of the last event. After 10 seconds have passed the next MASTER event will trigger the whole process again. Let me know if you see any other issues with this or would like any other changes. |
|
Thanks for taking care of this, just tell us when you are finished with testing in your environment. You have to determine how fragile this is (try to swap between master and backup a few times in a short timeframe), short flaps like this can happen. If the service gets stuck in some way then you still have something to fix, if not then its good (imho) |
This PR has the required changes to add fix the issue I reported in issue: #5023
The goal of this PR is to add automated support for CARP failover. When the OPNsense firewall is "MASTER" a carp syshook will
netbird upthe Netbird interface causing the peer to Netbird connect to the network. When an OPNsense firewall is "BACKUP" the same carp syshook willnetbird downthe Netbird interface causing the peer to disconnect from the Netbird network. This resolves the issues reported in the original issue and allows Netbird to work in a HA OPNsense environment.