Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/release-notes/release-notes-0.20.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@
sweeper where some outputs would not be resolved due to an error string
mismatch.

- [Fixed](https://github.com/lightningnetwork/lnd/pull/10277) a case in the
status manager where we would have disabled a channel although it was already
active in the switch. When a channel reconnects it starts a timer which might
only trigger after the disable timer has already been triggered, so we had a
race condition where we would unnecessarily mark a channel as disabled
although it was active in the first place.

# New Features

* Use persisted [nodeannouncement](https://github.com/lightningnetwork/lnd/pull/8825)
Expand Down
18 changes: 18 additions & 0 deletions netann/chan_status_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,24 @@ func (m *ChanStatusManager) disableInactiveChannels() {
continue
}

// Re-verify the channel is still inactive before disabling.
// This prevents the race condition where a channel becomes
// active again after being marked as pending disabled but
// before the disable timeout expires. Otherwise there is a
// race condition where the channel is disabled even though it
// is active.
chanID := lnwire.NewChanIDFromOutPoint(outpoint)
if m.cfg.IsChannelActive(chanID) {
// Channel became active again, cancel the pending
// disable.
log.Debugf("Channel(%v) became active, canceling "+
"scheduled disable", outpoint)

m.chanStates.markEnabled(outpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still getting through this PR, but want to make sure that we don't end up with inconsistent state here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg: what if right after this check, the channel actually goes inactive? This is why we have the fallow period where we wait for the state to be consistent, before taking an action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here would be to just continue, so not modify the state and wait for the channel to be active long enough to flip in the other direction.

Copy link
Collaborator Author

@ziggie1984 ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can do that as well, so the acutal enable timer flipped the channel correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist what do you think about not changing the state here but rather let the enableRequest flip it? That however would create and additional channel_update

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question, @ziggie1984. This is a good point to consider regarding the state transitions and network traffic.

Let's weigh the two approaches:

  1. Current PR's approach (calling m.chanStates.markEnabled(outpoint)):

    • Pros: This immediately corrects the internal state of the channel within chanStates from pending disabled to enabled as soon as IsChannelActive returns true. This directly "cancels" the pending disable from this function's perspective, ensuring internal consistency. Given the disableInactiveChannels function's role, it's robust for it to explicitly manage the state when it detects a channel that should not be disabled.
    • Cons: As you noted, if markEnabled always triggers a channel_update announcement, and the enableRequest mechanism also sends one, there could be two updates for the same state change.
  2. Alternative approach (just continue and let enableRequest handle it):

    • Pros: This would avoid a potentially redundant channel_update if the enableRequest is guaranteed to fire shortly after and handle the state transition and announcement.
    • Cons: The primary drawback here is that the channel's internal state within chanStates would remain pending disabled for a potentially significant period (up to 19 minutes based on the default enableTimer mentioned in the PR description), even though m.cfg.IsChannelActive(chanID) indicates it's active. This temporary inconsistency could lead to other parts of the system making decisions based on an outdated internal state. Relying solely on the enableRequest to flip the state introduces a delay that might not be acceptable for internal state accuracy.

My Recommendation:

I lean towards the current PR's approach of explicitly calling m.chanStates.markEnabled(outpoint). Ensuring immediate internal state consistency is generally more critical for system correctness and predictability than avoiding a potentially redundant network announcement. An extra channel_update is typically less harmful than prolonged internal state inconsistency, especially when dealing with long timers like the 19-minute enableTimer.

Ideally, the markEnabled function itself should be idempotent and intelligent enough to only issue a channel_update announcement if the publicly announced state of the channel genuinely changes. If the channel was already considered enabled by the enableRequest (which might have fired just before this check), then markEnabled should be a no-op in terms of external announcements. If markEnabled is not already designed this way, it might be worth reviewing its implementation to ensure it minimizes redundant announcements while maintaining internal state integrity.

Copy link
Collaborator Author

@ziggie1984 ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist I need to correct you tho, there will be no channelUpdate msg created in case we update the state of this channel in the disable method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification, @ziggie1984. That's a crucial detail that strengthens the argument for the current approach.

If m.chanStates.markEnabled(outpoint) does not generate a channel_update message when called within disableInactiveChannels, then the primary 'con' I identified for this approach is removed. This means we get the benefit of immediate internal state consistency (marking the channel as enabled in chanStates) without the overhead or potential redundancy of an extra network announcement.

This makes the current implementation even more robust, as it ensures the internal state accurately reflects the channel's active status without unnecessary external side effects. My recommendation to keep the explicit m.chanStates.markEnabled(outpoint) stands firm, as it prioritizes internal data integrity, which is paramount for preventing subtle bugs and ensuring predictable system behavior.


continue
}

log.Infof("Announcing channel(%v) disabled "+
"[detected]", outpoint)

Expand Down
52 changes: 52 additions & 0 deletions netann/chan_status_manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -909,6 +909,58 @@ var stateMachineTests = []stateMachineTest{
)
},
},
{
name: "channel reconnects before disable",
startActive: true,
startEnabled: true,
fn: func(h testHarness) {
// This test demonstrates the race condition fix
// where a channel becomes inactive, gets marked as
// pending disabled, then becomes active again before
// the disable timeout expires.
// The channel should NOT be disabled in this case.

// Step 1: Simulate disconnection - channel becomes
// inactive.
h.markInactive(h.graph.chans())

// Step 2: Wait for the channel to be marked as
// pending disabled. Sample interval of the manager
// is 50ms, so we wait for 50ms + 50ms (buffer).
time.Sleep(50*time.Millisecond + 50*time.Millisecond)

// Step 3: Simulate reconnection - channel becomes
// active again.
//
// NOTE: This does not reflect the actual behavior of
// LND because as soon as the channel becomes active it
// will start an enable timer and send an enable update.
// However we want to avoid testing these timings
// here. In general it is important that the channel
// does not get disabled in case it reconnects before
// the disable timeout expires.
h.markActive(h.graph.chans())

// Step 4: Wait for the disable timeout to expire.
// The disable timeout (1 second) expires, but our fix
// should prevent the disable because the channel is
// active again.
time.Sleep(time.Second + 200*time.Millisecond)

// Step 5: Verify that NO disable update was sent.
// The channel should remain enabled because it became
// active again before the disable timeout expired,
// and our fix re-checked its status before disabling.
h.assertNoUpdates(500 * time.Millisecond)

// Step 6: Verify that the channel is still enabled by
// checking that we can still request enable without
// sending an update. This means the channel is still
// enabled.
h.assertEnables(h.graph.chans(), nil, false)
h.assertNoUpdates(500 * time.Millisecond)
},
},
}

// TestChanStatusManagerStateMachine tests the possible state transitions that
Expand Down
Loading