-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix channel disable/enable race condition #10277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix channel disable/enable race condition #10277
Conversation
4e1de15
to
72322e2
Compare
We make sure we do not disable a channel if the channel is active in the switch and the enable timer hasn't fired yet during reconnection.
72322e2
to
4c7572b
Compare
log.Debugf("Channel(%v) became active, canceling "+ | ||
"scheduled disable", outpoint) | ||
|
||
m.chanStates.markEnabled(outpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still getting through this PR, but want to make sure that we don't end up with inconsistent state here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eg: what if right after this check, the channel actually goes inactive? This is why we have the fallow period where we wait for the state to be consistent, before taking an action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative here would be to just continue, so not modify the state and wait for the channel to be active long enough to flip in the other direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can do that as well, so the acutal enable timer flipped the channel correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gemini-code-assist what do you think about not changing the state here but rather let the enableRequest flip it? That however would create and additional channel_update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the question, @ziggie1984. This is a good point to consider regarding the state transitions and network traffic.
Let's weigh the two approaches:
-
Current PR's approach (calling
m.chanStates.markEnabled(outpoint)
):- Pros: This immediately corrects the internal state of the channel within
chanStates
frompending disabled
toenabled
as soon asIsChannelActive
returns true. This directly "cancels" the pending disable from this function's perspective, ensuring internal consistency. Given thedisableInactiveChannels
function's role, it's robust for it to explicitly manage the state when it detects a channel that should not be disabled. - Cons: As you noted, if
markEnabled
always triggers achannel_update
announcement, and theenableRequest
mechanism also sends one, there could be two updates for the same state change.
- Pros: This immediately corrects the internal state of the channel within
-
Alternative approach (just
continue
and letenableRequest
handle it):- Pros: This would avoid a potentially redundant
channel_update
if theenableRequest
is guaranteed to fire shortly after and handle the state transition and announcement. - Cons: The primary drawback here is that the channel's internal state within
chanStates
would remainpending disabled
for a potentially significant period (up to 19 minutes based on the defaultenableTimer
mentioned in the PR description), even thoughm.cfg.IsChannelActive(chanID)
indicates it's active. This temporary inconsistency could lead to other parts of the system making decisions based on an outdated internal state. Relying solely on theenableRequest
to flip the state introduces a delay that might not be acceptable for internal state accuracy.
- Pros: This would avoid a potentially redundant
My Recommendation:
I lean towards the current PR's approach of explicitly calling m.chanStates.markEnabled(outpoint)
. Ensuring immediate internal state consistency is generally more critical for system correctness and predictability than avoiding a potentially redundant network announcement. An extra channel_update
is typically less harmful than prolonged internal state inconsistency, especially when dealing with long timers like the 19-minute enableTimer
.
Ideally, the markEnabled
function itself should be idempotent and intelligent enough to only issue a channel_update
announcement if the publicly announced state of the channel genuinely changes. If the channel was already considered enabled
by the enableRequest
(which might have fired just before this check), then markEnabled
should be a no-op in terms of external announcements. If markEnabled
is not already designed this way, it might be worth reviewing its implementation to ensure it minimizes redundant announcements while maintaining internal state integrity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gemini-code-assist I need to correct you tho, there will be no channelUpdate msg created in case we update the state of this channel in the disable method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification, @ziggie1984. That's a crucial detail that strengthens the argument for the current approach.
If m.chanStates.markEnabled(outpoint)
does not generate a channel_update
message when called within disableInactiveChannels
, then the primary 'con' I identified for this approach is removed. This means we get the benefit of immediate internal state consistency (marking the channel as enabled in chanStates
) without the overhead or potential redundancy of an extra network announcement.
This makes the current implementation even more robust, as it ensures the internal state accurately reflects the channel's active status without unnecessary external side effects. My recommendation to keep the explicit m.chanStates.markEnabled(outpoint)
stands firm, as it prioritizes internal data integrity, which is paramount for preventing subtle bugs and ensuring predictable system behavior.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the race condition where a channel could be disabled despite being active. The fix in chan_status_manager.go
is direct and correct: re-verifying the channel's active status before disabling it is the right approach. The new test case in chan_status_manager_test.go
accurately simulates the race condition and confirms the fix works as expected. The release note is also clear and helpful for users. I've only found a minor typo in the release notes.
worked for the regtest node:
|
33c7f05
to
380f79f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's this case caught by the itest?
lnd/itest/lnd_channel_graph_test.go
Lines 22 to 44 in 87aed73
// testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update | |
// the channel graph as expected, and that channel state is properly updated | |
// in the presence of interleaved node disconnects / reconnects. | |
// | |
// NOTE: this test can be flaky as we are testing the chan-enable-timeout and | |
// chan-disable-timeout flags here. | |
// | |
// For chan-enable-timeout, setting this value too small will cause an enabled | |
// channel never be considered active by our channel status manager. Upon | |
// reconnection, our Brontide will send a request to enable this channel after | |
// the "chan-enable-timeout" has passed. The request is handled by the channel | |
// status manager, which will check the channel's eligibility to forward links | |
// by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the | |
// link as eligible unless it has finished its initialization, which takes some | |
// time. Thus, if the Brontide sends a request too early it will get a false | |
// report saying the channel link is not eligible because that link hasn't | |
// finished its initialization. | |
// | |
// For chan-disable-timeout, setting this value too small will cause an already | |
// enabled channel being marked as disabled. For instance, if some operations | |
// take more than 5 seconds to finish, the channel will be marked as disabled, | |
// thus a following operation will fail if it relies on the channel being | |
// enabled. |
Based on the description, I don't think you will have this issue if you have reasonable disable/enable timeout values, plus the channel will be re-enabled after chan-enable-timeout
. Re this case,
T=0ms: Channel disconnects
T=100ms: Marked as pending disabled → Disable timer starts (1000ms)
T=800ms: Channel reconnects → Enable timer starts (500ms) ❗ ---> we should cancel the disable timer before firing the enable timer
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED
I think the fix should be we cancel the disable timer before we start the enable timer.
Fixes #10259
The problem:
Timeline: Channel Disconnection and Reconnection Race Condition
The timing here are examples and configurable in LND, the defaults are
20min for disableTimer
19 for enableTimer