Skip to content

Add option to enable HA/CARP failover support to the os-netbird plugin#5067

Open
myah-mitchell wants to merge 10 commits intoopnsense:masterfrom
INDIGEX:master
Open

Add option to enable HA/CARP failover support to the os-netbird plugin#5067
myah-mitchell wants to merge 10 commits intoopnsense:masterfrom
INDIGEX:master

Conversation

@myah-mitchell
Copy link

This PR has the required changes to add fix the issue I reported in issue: #5023

The goal of this PR is to add automated support for CARP failover. When the OPNsense firewall is "MASTER" a carp syshook will netbird up the Netbird interface causing the peer to Netbird connect to the network. When an OPNsense firewall is "BACKUP" the same carp syshook will netbird down the Netbird interface causing the peer to disconnect from the Netbird network. This resolves the issues reported in the original issue and allows Netbird to work in a HA OPNsense environment.

@bcmmbaga
Copy link
Contributor

@fichtner can we get this reviewed?

@Monviech
Copy link
Member

Monviech commented Feb 12, 2026

Hello, let me help here.

I recently did the same thing:
#5108

Whats also important is not only the transition, but also guarding the service so it cannot start if the current host is not master, otherwise each HA sync will activate it again even after a transition. Simplest way for me was a rc script condition:

https://github.com/opnsense/ports/blob/bbb4f5e3ba959f762ba4e614d9d8e1880952c686/opnsense/ndp-proxy-go/files/ndp-proxy-go.in#L50-L53

It's also very important that stdout is not blocking any start or stop here, as we recently had a bug here:
2cc2215

The scripts run serialized, so if one script blocks, it blocks all other scripts during failover for 2-3 minutes.

If that is all taken into consideration I will review this PR.

@Monviech Monviech self-assigned this Feb 12, 2026
@bcmmbaga
Copy link
Contributor

Thanks for the feedback! We’ll review and apply your suggestions

@myah-mitchell
Copy link
Author

I've tested the above (minus d56d16c) updates on 25.10. I don't currently have a test firewall or any installs on 26.1 to test the mwexec to mwexecfm change (d56d16c) but as far as I understand mwexecfm replaced mwexec in 26.1.

I ended up using a postcmd instead of a precmd as Netbird should be running on the secondary firewall, we are just ensuring that Netbird is in a down (netbird down) state. Let me know if there are any other concerns.

I'll see about getting some firewalls spun up on 26.1 soon if no one else has ones they can test this on.

@myah-mitchell
Copy link
Author

After some more testing, I've determined that this still is not a full solution. Netbird up/down is actively creating/removing the wt0 interface. Without a "configctl filter reload", firewall rules applied to Netbird do not apply and traffic is blocked. I'll continue working on this next week.

@fichtner
Copy link
Member

@myah-mitchell thanks for the update, keep us posted :)

@myah-mitchell
Copy link
Author

The updated NetBird syshook now uses lock files to ensure the netbird up or netbird down is only run once per CARP state change. So, no matter if there is just one CARP address, or 100 CARP addresses this code only runs once when changing from MASTER to BACKUP or BACKUP to MASTER.

This was needed specifically on the netbird up side as we also need to reload the packet filter after the wt0 interface is created. This requires waiting a second or two after calling netbird up before running configctl filter reload. In my testing if I let each CARP state change reload the filter, we ended up with multiple running at the same time and the filter ended up in a broken state until manually reloaded.

This means that the script does cause a blocking state for a few seconds (up to 10 seconds) once per group of CARP state changes. I did not think the alternative of something like the following was a better option.

                mwexecfb(
                    '/bin/sh -c "'
                    . '/usr/local/bin/netbird up;'
                    . ' i=0; while [ $i -lt 10 ]; do'
                    . '   if [ -e /dev/wt0 ]; then'
                    . '     /usr/local/sbin/configctl filter reload;'
                    . '     exit 0;'
                    . '   fi;'
                    . '   sleep 1; i=$((i+1));'
                    . ' done;'
                    . '"'
                );

@myah-mitchell
Copy link
Author

I should note that commit ad8bee6 has been testing on my stack of OPNsense 25.10 units and will be rolled out to our test sites this afternoon/tomorrow. Commit d5e0b8f has not been tested as I have still not installed 26.1 anywhere yet.

@Monviech
Copy link
Member

Monviech commented Feb 26, 2026

Is "netbird up" not idempodent? I would like to prevent any locking in the carp syshooks.

If it should only be called once for some reason here is a non blocking trampoline example that was recently used:

opnsense/core@5423b72

Though not needing that would be preferred.

@myah-mitchell
Copy link
Author

myah-mitchell commented Feb 26, 2026

As far as I know, yes, it is idempotent. However, the first time netbird up is run, it also creates wt0. Until configctl filter reload is run, none of the firewall rules applied to wt0 take effect. When testing with something like "netbird up; sleep 3; configctl filter reload" the filter was not working correctly and would not pass new traffic until a reload of the filter. Limiting this reload to only once per CARP state change resolved the issue.

I'll take a look at what you linked.

@myah-mitchell
Copy link
Author

If I'm following, your main concern is the waiting for wt0 and then the filter reload being within the carp syshook. So, move those items to their own script that the syshook can call? Is python the preferred language to use in that case?

@Monviech
Copy link
Member

Monviech commented Feb 27, 2026

Only the logic is important, less the language. You can also use sh or php I don't mind.

And yes my main concern is any blocking in carp, please execute as fast as possible since if all other scripts are blocked people will come for us "why is my failover taking n seconds before my services are back up". :)

myah-mitchell and others added 3 commits February 27, 2026 11:26
@myah-mitchell
Copy link
Author

Alright, I've moved the blocking code to be in a separate script so that the syshook process is non-blocking.

To prevent delays in NetBird starting after a CARP failover, I've configured the script to run with the first CARP MASTER event instead of the last. On small firewalls with just a few CARP addresses this will save a second or two, but on larger units (we have some firewalls with nearly 100 CARP addresses) this should save much more time resulting in a shorter down time.

The way I've set up the "debounce" is to trigger on the first event and then to not trigger again for any event that occurs within an updating 10 second time window of the last event. After 10 seconds have passed the next MASTER event will trigger the whole process again.

Let me know if you see any other issues with this or would like any other changes.

@Monviech
Copy link
Member

Monviech commented Feb 27, 2026

Thanks for taking care of this, just tell us when you are finished with testing in your environment.

You have to determine how fragile this is (try to swap between master and backup a few times in a short timeframe), short flaps like this can happen.

If the service gets stuck in some way then you still have something to fix, if not then its good (imho)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants