Skip to content

Conversation

ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Oct 8, 2025

Fixes #10259

The problem:

Timeline: Channel Disconnection and Reconnection Race Condition

Timeline: 0ms ────── 100ms ────── 800ms ────── 1100ms ────── 1300ms
          │          │           │            │             │
Channel:  [INACTIVE] → [PENDING] → [ACTIVE]   → [DISABLED]  → [ENABLED]
          │          │           │            │             │
Events:   Disconnect  Mark       Reconnect    Disable       Enable
          │          Pending     │            │             │
          │          │           │            │             │
          └──────────┼───────────┼────────────┼─────────────┘
                     │           │            │
Disable Timer:       └───────────┼────────────┘
                     (1000ms)    │
                                 │
Enable Timer:                    └────────────┼─────────────┘
                                 (500ms)     │
                                              │
                                              ❌ PROBLEM!


T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms)
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

The timing here are examples and configurable in LND, the defaults are

20min for disableTimer
19 for enableTimer

@ziggie1984 ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 4e1de15 to 72322e2 Compare October 8, 2025 08:02
@ziggie1984 ziggie1984 self-assigned this Oct 8, 2025
@ziggie1984 ziggie1984 marked this pull request as ready for review October 8, 2025 08:15
@ziggie1984 ziggie1984 added this to the v0.20.0 milestone Oct 8, 2025
We make sure we do not disable a channel if the channel is active
in the switch and the enable timer hasn't fired yet during
reconnection.
@ziggie1984 ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 72322e2 to 4c7572b Compare October 8, 2025 09:29
log.Debugf("Channel(%v) became active, canceling "+
"scheduled disable", outpoint)

m.chanStates.markEnabled(outpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still getting through this PR, but want to make sure that we don't end up with inconsistent state here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg: what if right after this check, the channel actually goes inactive? This is why we have the fallow period where we wait for the state to be consistent, before taking an action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here would be to just continue, so not modify the state and wait for the channel to be active long enough to flip in the other direction.

Copy link
Collaborator Author

@ziggie1984 ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can do that as well, so the acutal enable timer flipped the channel correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist what do you think about not changing the state here but rather let the enableRequest flip it? That however would create and additional channel_update

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question, @ziggie1984. This is a good point to consider regarding the state transitions and network traffic.

Let's weigh the two approaches:

  1. Current PR's approach (calling m.chanStates.markEnabled(outpoint)):

    • Pros: This immediately corrects the internal state of the channel within chanStates from pending disabled to enabled as soon as IsChannelActive returns true. This directly "cancels" the pending disable from this function's perspective, ensuring internal consistency. Given the disableInactiveChannels function's role, it's robust for it to explicitly manage the state when it detects a channel that should not be disabled.
    • Cons: As you noted, if markEnabled always triggers a channel_update announcement, and the enableRequest mechanism also sends one, there could be two updates for the same state change.
  2. Alternative approach (just continue and let enableRequest handle it):

    • Pros: This would avoid a potentially redundant channel_update if the enableRequest is guaranteed to fire shortly after and handle the state transition and announcement.
    • Cons: The primary drawback here is that the channel's internal state within chanStates would remain pending disabled for a potentially significant period (up to 19 minutes based on the default enableTimer mentioned in the PR description), even though m.cfg.IsChannelActive(chanID) indicates it's active. This temporary inconsistency could lead to other parts of the system making decisions based on an outdated internal state. Relying solely on the enableRequest to flip the state introduces a delay that might not be acceptable for internal state accuracy.

My Recommendation:

I lean towards the current PR's approach of explicitly calling m.chanStates.markEnabled(outpoint). Ensuring immediate internal state consistency is generally more critical for system correctness and predictability than avoiding a potentially redundant network announcement. An extra channel_update is typically less harmful than prolonged internal state inconsistency, especially when dealing with long timers like the 19-minute enableTimer.

Ideally, the markEnabled function itself should be idempotent and intelligent enough to only issue a channel_update announcement if the publicly announced state of the channel genuinely changes. If the channel was already considered enabled by the enableRequest (which might have fired just before this check), then markEnabled should be a no-op in terms of external announcements. If markEnabled is not already designed this way, it might be worth reviewing its implementation to ensure it minimizes redundant announcements while maintaining internal state integrity.

Copy link
Collaborator Author

@ziggie1984 ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist I need to correct you tho, there will be no channelUpdate msg created in case we update the state of this channel in the disable method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification, @ziggie1984. That's a crucial detail that strengthens the argument for the current approach.

If m.chanStates.markEnabled(outpoint) does not generate a channel_update message when called within disableInactiveChannels, then the primary 'con' I identified for this approach is removed. This means we get the benefit of immediate internal state consistency (marking the channel as enabled in chanStates) without the overhead or potential redundancy of an extra network announcement.

This makes the current implementation even more robust, as it ensures the internal state accurately reflects the channel's active status without unnecessary external side effects. My recommendation to keep the explicit m.chanStates.markEnabled(outpoint) stands firm, as it prioritizes internal data integrity, which is paramount for preventing subtle bugs and ensuring predictable system behavior.

@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the race condition where a channel could be disabled despite being active. The fix in chan_status_manager.go is direct and correct: re-verifying the channel's active status before disabling it is the right approach. The new test case in chan_status_manager_test.go accurately simulates the race condition and confirms the fix works as expected. The release note is also clear and helpful for users. I've only found a minor typo in the release notes.

@ziggie1984
Copy link
Collaborator Author

worked for the regtest node:

.703 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:56:13.703 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #140)
2025-10-08 13:56:51.521 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:57:13.701 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:57:13.735 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:57:13.735 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #141)
2025-10-08 13:57:51.513 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:13.736 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:58:13.756 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:58:13.756 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #142)
2025-10-08 13:58:51.511 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:57.832 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.839 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Pong(len(pong_bytes)=981) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Pong(len(pong_bytes)=1161) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:59:13.758 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:59:13.838 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:59:13.838 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #143)
2025-10-08 13:59:51.522 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:59:57.829 [DBG] NANN: Channel(23a137da04ed053dad8aa397e2c2e1b127d8c781bab3fc4c1bb6052348110a98:0) already enabled, skipped announcement

@ziggie1984 ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 33c7f05 to 380f79f Compare October 8, 2025 17:06
@saubyk saubyk removed this from lnd v0.20 Oct 8, 2025
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's this case caught by the itest?

// testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update
// the channel graph as expected, and that channel state is properly updated
// in the presence of interleaved node disconnects / reconnects.
//
// NOTE: this test can be flaky as we are testing the chan-enable-timeout and
// chan-disable-timeout flags here.
//
// For chan-enable-timeout, setting this value too small will cause an enabled
// channel never be considered active by our channel status manager. Upon
// reconnection, our Brontide will send a request to enable this channel after
// the "chan-enable-timeout" has passed. The request is handled by the channel
// status manager, which will check the channel's eligibility to forward links
// by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the
// link as eligible unless it has finished its initialization, which takes some
// time. Thus, if the Brontide sends a request too early it will get a false
// report saying the channel link is not eligible because that link hasn't
// finished its initialization.
//
// For chan-disable-timeout, setting this value too small will cause an already
// enabled channel being marked as disabled. For instance, if some operations
// take more than 5 seconds to finish, the channel will be marked as disabled,
// thus a following operation will fail if it relies on the channel being
// enabled.

Based on the description, I don't think you will have this issue if you have reasonable disable/enable timeout values, plus the channel will be re-enabled after chan-enable-timeout. Re this case,

T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms) ❗ ---> we should cancel the disable timer before firing the enable timer
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

I think the fix should be we cancel the disable timer before we start the enable timer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: Most peers disabled channels ~15 minutes after restart
3 participants