fix channel disable/enable race condition #10277

ziggie1984 · 2025-10-08T08:02:00Z

The problem:

Timeline: Channel Disconnection and Reconnection Race Condition

Timeline: 0ms ────── 100ms ────── 800ms ────── 1100ms ────── 1300ms
          │          │           │            │             │
Channel:  [INACTIVE] → [PENDING] → [ACTIVE]   → [DISABLED]  → [ENABLED]
          │          │           │            │             │
Events:   Disconnect  Mark       Reconnect    Disable       Enable
          │          Pending     │            │             │
          │          │           │            │             │
          └──────────┼───────────┼────────────┼─────────────┘
                     │           │            │
Disable Timer:       └───────────┼────────────┘
                     (1000ms)    │
                                 │
Enable Timer:                    └────────────┼─────────────┘
                                 (500ms)     │
                                              │
                                              ❌ PROBLEM!

T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms)
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

The timing here are examples and configurable in LND, the defaults are

20min for disableTimer
19 for enableTimer

We make sure we do not disable a channel if the channel is active in the switch and the enable timer hasn't fired yet during reconnection.

Roasbeef · 2025-10-08T09:53:14Z

netann/chan_status_manager.go

+			log.Debugf("Channel(%v) became active, canceling "+
+				"scheduled disable", outpoint)
+
+			m.chanStates.markEnabled(outpoint)


Still getting through this PR, but want to make sure that we don't end up with inconsistent state here.

Eg: what if right after this check, the channel actually goes inactive? This is why we have the fallow period where we wait for the state to be consistent, before taking an action.

An alternative here would be to just continue, so not modify the state and wait for the channel to be active long enough to flip in the other direction.

yes we can do that as well, so the acutal enable timer flipped the channel correctly.

@gemini-code-assist what do you think about not changing the state here but rather let the enableRequest flip it? That however would create and additional channel_update

Thanks for the question, @ziggie1984. This is a good point to consider regarding the state transitions and network traffic.

Let's weigh the two approaches:

Current PR's approach (calling m.chanStates.markEnabled(outpoint)):

Pros: This immediately corrects the internal state of the channel within chanStates from pending disabled to enabled as soon as IsChannelActive returns true. This directly "cancels" the pending disable from this function's perspective, ensuring internal consistency. Given the disableInactiveChannels function's role, it's robust for it to explicitly manage the state when it detects a channel that should not be disabled.

Cons: As you noted, if markEnabled always triggers a channel_update announcement, and the enableRequest mechanism also sends one, there could be two updates for the same state change.

Alternative approach (just continue and let enableRequest handle it):

Pros: This would avoid a potentially redundant channel_update if the enableRequest is guaranteed to fire shortly after and handle the state transition and announcement.

Cons: The primary drawback here is that the channel's internal state within chanStates would remain pending disabled for a potentially significant period (up to 19 minutes based on the default enableTimer mentioned in the PR description), even though m.cfg.IsChannelActive(chanID) indicates it's active. This temporary inconsistency could lead to other parts of the system making decisions based on an outdated internal state. Relying solely on the enableRequest to flip the state introduces a delay that might not be acceptable for internal state accuracy.

My Recommendation:

I lean towards the current PR's approach of explicitly calling m.chanStates.markEnabled(outpoint). Ensuring immediate internal state consistency is generally more critical for system correctness and predictability than avoiding a potentially redundant network announcement. An extra channel_update is typically less harmful than prolonged internal state inconsistency, especially when dealing with long timers like the 19-minute enableTimer.

Ideally, the markEnabled function itself should be idempotent and intelligent enough to only issue a channel_update announcement if the publicly announced state of the channel genuinely changes. If the channel was already considered enabled by the enableRequest (which might have fired just before this check), then markEnabled should be a no-op in terms of external announcements. If markEnabled is not already designed this way, it might be worth reviewing its implementation to ensure it minimizes redundant announcements while maintaining internal state integrity.

@gemini-code-assist I need to correct you tho, there will be no channelUpdate msg created in case we update the state of this channel in the disable method.

Thank you for the clarification, @ziggie1984. That's a crucial detail that strengthens the argument for the current approach.

If m.chanStates.markEnabled(outpoint) does not generate a channel_update message when called within disableInactiveChannels, then the primary 'con' I identified for this approach is removed. This means we get the benefit of immediate internal state consistency (marking the channel as enabled in chanStates) without the overhead or potential redundancy of an extra network announcement.

This makes the current implementation even more robust, as it ensures the internal state accurately reflects the channel's active status without unnecessary external side effects. My recommendation to keep the explicit m.chanStates.markEnabled(outpoint) stands firm, as it prioritizes internal data integrity, which is paramount for preventing subtle bugs and ensuring predictable system behavior.

docs/release-notes/release-notes-0.20.0.md

ziggie1984 · 2025-10-08T10:31:48Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses the race condition where a channel could be disabled despite being active. The fix in chan_status_manager.go is direct and correct: re-verifying the channel's active status before disabling it is the right approach. The new test case in chan_status_manager_test.go accurately simulates the race condition and confirms the fix works as expected. The release note is also clear and helpful for users. I've only found a minor typo in the release notes.

docs/release-notes/release-notes-0.20.0.md

ziggie1984 · 2025-10-08T17:05:27Z

worked for the regtest node:

.703 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:56:13.703 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #140)
2025-10-08 13:56:51.521 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:57:13.701 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:57:13.735 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:57:13.735 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #141)
2025-10-08 13:57:51.513 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:13.736 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:58:13.756 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:58:13.756 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #142)
2025-10-08 13:58:51.511 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:57.832 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.839 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Pong(len(pong_bytes)=981) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Pong(len(pong_bytes)=1161) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:59:13.758 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:59:13.838 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:59:13.838 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #143)
2025-10-08 13:59:51.522 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:59:57.829 [DBG] NANN: Channel(23a137da04ed053dad8aa397e2c2e1b127d8c781bab3fc4c1bb6052348110a98:0) already enabled, skipped announcement

yyforyongyu

Looks like it's this case caught by the itest?

lnd/itest/lnd_channel_graph_test.go

Lines 22 to 44 in 87aed73

    
           // testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update 
        
           // the channel graph as expected, and that channel state is properly updated 
        
           // in the presence of interleaved node disconnects / reconnects. 
        
           // 
        
           // NOTE: this test can be flaky as we are testing the chan-enable-timeout and 
        
           // chan-disable-timeout flags here. 
        
           // 
        
           // For chan-enable-timeout, setting this value too small will cause an enabled 
        
           // channel never be considered active by our channel status manager. Upon 
        
           // reconnection, our Brontide will send a request to enable this channel after 
        
           // the "chan-enable-timeout" has passed. The request is handled by the channel 
        
           // status manager, which will check the channel's eligibility to forward links 
        
           // by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the 
        
           // link as eligible unless it has finished its initialization, which takes some 
        
           // time. Thus, if the Brontide sends a request too early it will get a false 
        
           // report saying the channel link is not eligible because that link hasn't 
        
           // finished its initialization. 
        
           // 
        
           // For chan-disable-timeout, setting this value too small will cause an already 
        
           // enabled channel being marked as disabled. For instance, if some operations 
        
           // take more than 5 seconds to finish, the channel will be marked as disabled, 
        
           // thus a following operation will fail if it relies on the channel being 
        
           // enabled.

Based on the description, I don't think you will have this issue if you have reasonable disable/enable timeout values, plus the channel will be re-enabled after chan-enable-timeout. Re this case,

T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms) ❗ ---> we should cancel the disable timer before firing the enable timer
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

I think the fix should be we cancel the disable timer before we start the enable timer.

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 4e1de15 to 72322e2 Compare October 8, 2025 08:02

ziggie1984 self-assigned this Oct 8, 2025

ziggie1984 marked this pull request as ready for review October 8, 2025 08:15

ziggie1984 added the gossip label Oct 8, 2025

ziggie1984 added this to lnd v0.20 Oct 8, 2025

ziggie1984 added this to the v0.20.0 milestone Oct 8, 2025

netann: fix disable issue for channels

a97dda8

We make sure we do not disable a channel if the channel is active in the switch and the enable timer hasn't fired yet during reconnection.

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 72322e2 to 4c7572b Compare October 8, 2025 09:29

Roasbeef requested changes Oct 8, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

docs/release-notes/release-notes-0.20.0.md Outdated Show resolved Hide resolved

docs: add release-notes

380f79f

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 33c7f05 to 380f79f Compare October 8, 2025 17:06

saubyk removed this from lnd v0.20 Oct 8, 2025

yyforyongyu requested changes Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix channel disable/enable race condition #10277

fix channel disable/enable race condition #10277

ziggie1984 commented Oct 8, 2025 •

edited

Loading

Uh oh!

Roasbeef Oct 8, 2025

Uh oh!

Roasbeef Oct 8, 2025

Uh oh!

Roasbeef Oct 8, 2025

Uh oh!

ziggie1984 Oct 8, 2025 •

edited

Loading

Uh oh!

ziggie1984 Oct 8, 2025

Uh oh!

gemini-code-assist bot Oct 8, 2025

Uh oh!

ziggie1984 Oct 8, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot Oct 8, 2025

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

yyforyongyu left a comment

Uh oh!

Uh oh!

	// testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update
	// the channel graph as expected, and that channel state is properly updated
	// in the presence of interleaved node disconnects / reconnects.
	//
	// NOTE: this test can be flaky as we are testing the chan-enable-timeout and
	// chan-disable-timeout flags here.
	//
	// For chan-enable-timeout, setting this value too small will cause an enabled
	// channel never be considered active by our channel status manager. Upon
	// reconnection, our Brontide will send a request to enable this channel after
	// the "chan-enable-timeout" has passed. The request is handled by the channel
	// status manager, which will check the channel's eligibility to forward links
	// by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the
	// link as eligible unless it has finished its initialization, which takes some
	// time. Thus, if the Brontide sends a request too early it will get a false
	// report saying the channel link is not eligible because that link hasn't
	// finished its initialization.
	//
	// For chan-disable-timeout, setting this value too small will cause an already
	// enabled channel being marked as disabled. For instance, if some operations
	// take more than 5 seconds to finish, the channel will be marked as disabled,
	// thus a following operation will fail if it relies on the channel being
	// enabled.

fix channel disable/enable race condition #10277

Are you sure you want to change the base?

fix channel disable/enable race condition #10277

Conversation

ziggie1984 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025 •

edited

Loading

ziggie1984 Oct 8, 2025 •

edited

Loading

ziggie1984 Oct 8, 2025 •

edited

Loading