Skip to content

Implement permanent node failure handling #1649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dunxen
Copy link
Contributor

@dunxen dunxen commented Aug 4, 2022

We never had the NetworkGraph::node_failed method implemented. The scorer now handles non-permanent failures by downgrading nodes, so we don't want that implemented.

The method is renamed to node_failed_permanent to explicitly indicate that this is the only case it handles.
There is a test case for non-permanent failure which asserts that the graph local storage does not change, and one for permanent failure which checks that the node and associated channels were removed.

Fixes #1644

@dunxen
Copy link
Contributor Author

dunxen commented Aug 4, 2022

Oof, will fix up in a bit.

@jkczyz jkczyz self-requested a review August 4, 2022 14:11
@TheBlueMatt
Copy link
Collaborator

One general issue - what will happen in practice here is we'll remove a node and then immediately re-add it when we resync gossip. Now, its probably not actually a big deal because we never actually get perm-node-failures, but we should implement something to track "I removed node id X" and just remember that for like a week or two. We should also do the same for channels, which we currently dont.

@dunxen
Copy link
Contributor Author

dunxen commented Aug 7, 2022

we should implement something to track "I removed node id X" and just remember that for like a week or two

Thanks for the insight here. Do we by chance track anything else in a similar way? Not sure what would be the most appropriate way to implement this.

@TheBlueMatt
Copy link
Collaborator

Hmm, I don't think so. We should be able to track it just as a HashMap here, from the node_id/SCID to the time we decided to remove it. Then when we do expiration passes in the network graph we can do a similar pass on the map here and remove anything that was added more than a week ago.

@dunxen dunxen marked this pull request as draft August 9, 2022 20:01
@dunxen
Copy link
Contributor Author

dunxen commented Aug 15, 2022

Hmm. So I'm quite stumped with the "lock order violation" in the test. I've narrowed it down to network_graph.channels.write() in node_failed_permanent(), but can't see anything obvious here. I don't see any specific scoping issues at least.

I'll get started on the expirations after the lock issue is resolved.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from a01a791 to 3691cfa Compare August 16, 2022 08:50
@dunxen
Copy link
Contributor Author

dunxen commented Aug 16, 2022

Hmm, I don't think so. We should be able to track it just as a HashMap here, from the node_id/SCID to the time we decided to remove it. Then when we do expiration passes in the network graph we can do a similar pass on the map here and remove anything that was added more than a week ago.

So I'm currently thinking of storing the HashMap in NetworkGraph behind a mutex, and then since we will have time involved I'm basically going to create extra no-std versions of the methods that take in UNIX time like some of the other methods already do (unless that's unnecessary). NetworkGraph::channel_failed would also need one.

Then of course for a while ignoring those nodes/channels when resyncing gossip, so they're not re-added.

Is that what you had imagined somewhat?

@TheBlueMatt
Copy link
Collaborator

Yea, that all sounds basically good. I wonder if we should just disable this behavior entirely for no-std - we don't currently have a decent way to measure the passing of time in the network graph, and while we could receive block header updates there to do it like we do elsewhere, I'm a bit hesitant to bother.

@dunxen
Copy link
Contributor Author

dunxen commented Aug 18, 2022

I wonder if we should just disable this behaviour entirely for no-std - we don't currently have a decent way to measure the passing of time in the network graph, and while we could receive block header updates there to do it like we do elsewhere, I'm a bit hesitant to bother.

I guess it's also not super likely that path finding would be something that happens on (embedded) devices that use no-std. At least to me, it seems that would be much harder to do nicely anyway.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from 3691cfa to ea8572c Compare August 18, 2022 14:35
@dunxen
Copy link
Contributor Author

dunxen commented Aug 18, 2022

Almost ready to be promoted from draft. Just need to add the tracking/expiry test and clean up. Pushed these for transparency into current work.

I did go the way of not adding nodes to the HashMap or expiring them for no_std. Probably could be cleaned up a little too.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not going to remove from the new maps on no-std, we can't insert into them either.

@dunxen
Copy link
Contributor Author

dunxen commented Aug 18, 2022

If we're not going to remove from the new maps on no-std, we can't insert into them either.

Definitely missed this, thanks!

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from afa3249 to f3fd340 Compare August 26, 2022 10:16
@dunxen
Copy link
Contributor Author

dunxen commented Aug 26, 2022

Actually going to squash before opening up for review as a lot has changed and it's a bit cumbersome to review right now. Plus some failing checks still.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from f3fd340 to 8551d98 Compare August 26, 2022 11:20
@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2022

Codecov Report

Base: 90.73% // Head: 90.85% // Increases project coverage by +0.11% 🎉

Coverage data is based on head (468814f) compared to base (5211bfd).
Patch coverage: 96.74% of modified lines in pull request are covered.

❗ Current head 468814f differs from pull request most recent head 243ca91. Consider uploading reports for the commit 243ca91 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1649      +/-   ##
==========================================
+ Coverage   90.73%   90.85%   +0.11%     
==========================================
  Files          87       87              
  Lines       46688    47601     +913     
  Branches    46688    47601     +913     
==========================================
+ Hits        42361    43246     +885     
- Misses       4327     4355      +28     
Impacted Files Coverage Δ
lightning-background-processor/src/lib.rs 96.23% <ø> (ø)
lightning/src/routing/gossip.rs 92.23% <96.74%> (+0.79%) ⬆️
lightning/src/chain/onchaintx.rs 94.71% <0.00%> (-0.92%) ⬇️
lightning/src/util/events.rs 38.13% <0.00%> (-0.27%) ⬇️
lightning/src/ln/functional_tests.rs 97.08% <0.00%> (+0.01%) ⬆️
lightning-net-tokio/src/lib.rs 77.03% <0.00%> (+0.30%) ⬆️
lightning/src/ln/channelmanager.rs 86.92% <0.00%> (+1.80%) ⬆️
lightning/src/util/wakers.rs 89.26% <0.00%> (+2.55%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from 8551d98 to 0326774 Compare August 26, 2022 12:03
@dunxen dunxen marked this pull request as ready for review August 26, 2022 12:06

#[cfg(feature = "std")]
{
self.removed_channels.lock().unwrap().retain(|_, time| time.elapsed().as_secs() < REMOVED_ENTRIES_TRACKING_AGE_LIMIT_SECS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but it'd be nice to use current_time_unix here, its a bit strange to pass it in and then use the wall-clock time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah it would be nice but I'm not sure I quite get how to make that work when we have Instants :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should swap the Instant for a u64? If not, we at least need to only fetch the time once in these retains, not for each element. I'm not sure if there's a platform where it actually is materially slow (ie a syscall), but not all platforms are as fast as rtdsc, so best to not, IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we convert to Duration using Time::duration_since_epoch?

fn duration_since_epoch() -> Duration {
use std::time::SystemTime;
SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap()

Then we can continue to use Time throughout and in tests. Or at least use Time still but convert to Duration before putting it into the HashMap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since-epoch is the system clock, not the monotonic clock. We shouldn't use it for time keeping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he's also tracking for no-std. Can't remember why. Maybe we just not do that in that patch too?

Anyway, let me do last fixups and get this ready for merge

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right i guess that's the question - do we want to do all this tracking with no-std as well or only std. I'm not sure any of this matters all that much - on a week timescale they should be equivalent, and no-std users are unlikely to want a full graph I assume.... either way let's keep it consistent, I don't really care, just let's not query every iteration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if I had to pick id pick supporting no-std and using system clock over not, but...it's a wash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok well, supporting no-std and using system time will clear up a lot of the conditional compilation and I suppose on the order of a week, system time is alright.

Then we will also be using the current_unix_time arg and other nice things. If we ever decide to serialise and persist the maps later then it's easy also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a big fat comment explaining this :)

@TheBlueMatt
Copy link
Collaborator

Feel free to squash, I think.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from 9ca073c to b97e4ec Compare September 21, 2022 16:02
*entry.get_mut() = channel_info;
return Ok(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hmm, I may take that back - I guess in principle the nodes on the channel can change in a reorg (ie cause of a short-channel-id collision where another channel takes the old channel's place on a reorg). In that case we need to remove the entries for this channel on the old nodes and re-add the entries for the new nodes, which I guess is why the code was the way it was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have remembered why I did it that particular way but the reasoning sounds familiar.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did it this way because I asked you to change it and suggested this would be a totally fine and acceptable change....sorry about that.

TheBlueMatt
TheBlueMatt previously approved these changes Sep 26, 2022
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit, then LGTM, for real this time.


#[cfg(feature = "std")]
{
self.removed_channels.lock().unwrap().retain(|_, time| time.elapsed().as_secs() < REMOVED_ENTRIES_TRACKING_AGE_LIMIT_SECS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should swap the Instant for a u64? If not, we at least need to only fetch the time once in these retains, not for each element. I'm not sure if there's a platform where it actually is materially slow (ie a syscall), but not all platforms are as fast as rtdsc, so best to not, IMO.

@TheBlueMatt
Copy link
Collaborator

@jkczyz wanna take another look?

for scid in node.channels.iter() {
if let Some(chan_info) = channels.remove(scid) {
let other_node_id = if node_id == chan_info.node_one { chan_info.node_two } else { chan_info.node_one };
if let BtreeEntry::Occupied(mut node_entry) = nodes.entry(other_node_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/node_entry/other_node_entry

}
}
#[cfg(feature = "std")]
removed_nodes.insert(node_id, time);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the other node has no more channels, should it also be removed? Or will that be a problem if another channel is advertised for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually, I guess this only for permanent failures for the passed in node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. We don't want to track it as removed as it wasn't a permanent failure.


#[cfg(feature = "std")]
{
self.removed_channels.lock().unwrap().retain(|_, time| time.elapsed().as_secs() < REMOVED_ENTRIES_TRACKING_AGE_LIMIT_SECS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we convert to Duration using Time::duration_since_epoch?

fn duration_since_epoch() -> Duration {
use std::time::SystemTime;
SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap()

Then we can continue to use Time throughout and in tests. Or at least use Time still but convert to Duration before putting it into the HashMap.

Comment on lines 1228 to 1231
#[cfg(feature = "std")]
removed_nodes: Mutex::new(HashMap::new()),
#[cfg(feature = "std")]
removed_channels: Mutex::new(HashMap::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we write these out as Duration like we do in the scorer? Related, see other comment about using Time but putting a Duration in the HashMap.

@dunxen
Copy link
Contributor Author

dunxen commented Sep 28, 2022

Ok, so there was enough change to warrant an extra commit just for review. I'll squash when everyone is happy :)

Comment on lines 1581 to 1582
#[cfg(feature = "std")]
/// Marks a channel in the graph as failed if a corresponding HTLC fail was sent.
/// If permanent, removes a channel from the local storage.
/// May cause the removal of nodes too, if this was their last channel.
/// If not permanent, makes channels unavailable for routing.
///
/// This method is only available with the `std` feature. See
/// [`NetworkGraph::channel_failed_with_time`] for `no-std` use.
pub fn channel_failed(&self, short_channel_id: u64, is_permanent: bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called in the NetworkGraph's event handler, which doesn't have any conditioning on std.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right. I think we run into the issue of not being able to pass through current_time_unix from somewhere for no-std. I’ll think about it for a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's no way for us to pass through current UNIX time in the even handler for no-std without changing the EventHandler and EventsProvider traits to maybe support passing it in form the public API :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would someone using no-std get the current Unix time from? If they could implement util::time::Time to return it, then NetworkGraph could be parameterized by it if we desire this behavior in no-std.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, duh, uhmmmm....so let's do two things - let's track the last timestamp where the user ran the cleanup method (and if one is set, use that for removal time), and if not, we'll have to not track removals in no-std. Sorry about the mess here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I kind of didn't see the whole picture until going back and forth on this one.

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch from 468814f to 243ca91 Compare October 1, 2022 10:35
@dunxen
Copy link
Contributor Author

dunxen commented Oct 1, 2022

Ok let's try that again and squashed to avoid a review mess :)

Edit: Oh stupid me. Will refactor not to use std::swap and instead use core::swap

@dunxen dunxen force-pushed the 2022-08-implementnodefailure branch 2 times, most recently from 818a5c3 to a80331d Compare October 1, 2022 17:00
// week).
//
/// Keeps track of short channel IDs for channels we have explicitly removed due to permanent
/// failure so that we don't resync them from gossip. Each SCID is mapped to the time in seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this comment is too verbose and may be compressed x2.

Keeps track of short channel IDs for channels we have explicitly removed due to permanent failure so that we don't resync them from gossip. Each SCID is mapped to the time (in seconds) it was removed so that once some time passes, we can potentially resync it from gossip again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added these to the latest (squashed) changes :)

@naumenkogs
Copy link
Contributor

ACK a80331d

TheBlueMatt
TheBlueMatt previously approved these changes Oct 4, 2022
We never had the `NetworkGraph::node_failed` method implemented. The
scorer now handles non-permanent failures by downgrading nodes, so we
don't want that implemented.

The method is renamed to `node_failed_permanent` to explicitly indicate
that this is the only case it handles. We also add tracking in the form
of two maps as fields of `NetworkGraph`, namely, `removed_nodes` and
`removed_channels`. We track these removed graph entries to ensure we
don't just resync them right away from gossip soon after removing them.
We stop tracking these removed nodes whenever `remove_stale_channels_and_tracking()`
is called and the entries have been tracked for longer than
`REMOVED_ENTRIES_TRACKING_AGE_LIMIT_SECS` which is currently set to one
week.
@TheBlueMatt TheBlueMatt merged commit a626828 into lightningdevkit:main Oct 7, 2022
@dunxen dunxen deleted the 2022-08-implementnodefailure branch January 20, 2025 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement marking node in network graph as failed
5 participants