neutrino: improve perf in high latency networks #300

djkazic · 2024-05-29T11:29:39Z

After working with @hsjoberg and @niteshbalusu11 to troubleshoot Blixt performance we realized two things were consistently going wrong in networks with > around 300ms latency:

Retries would be exhausted quickly
Due to that --^ timeouts being doubled did not help much

After investigating, we decided to experiment with changing some constants in neutrino to work better.

This PR addresses both issues by:

Decreasing maxBatch to 10
Increasing query retries from 2 to 8
Increasing rescan retries from 1 to 2
Increasing broadcast query retries from 1 to 8

The result is that we can now use Blixt with networks that we could not previously.

djkazic · 2024-05-30T15:37:07Z

OK, it appears that queryAllPeers breaks the unit test when the numRetries of the defaultQueryOptions struct is > 1.

djkazic · 2024-05-30T16:10:08Z

Cool, so I fixed the unit test.

Now, the Blixt team will do some testing to make sure that we haven't also rolled back the perf improvements.

ellemouton · 2024-05-30T16:10:23Z

rescan.go

@@ -956,7 +956,7 @@ func (rs *rescanState) handleBlockConnected(ntfn *blockntfns.Connected) error {

 	// Otherwise, we'll attempt to fetch the filter to retrieve the relevant
 	// transactions and notify them.
-	queryOptions := NumRetries(0)
+	queryOptions := NumRetries(2)


would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)

also - wondering if we should make this configurable rather than hard code it?

So the timeouts and retries are directly connected to one another, as when we have a timeout we'll do a retry (bounded by the retry count). What we found from trace logs was that the networks didn't consistently timeout, but rather during higher bandwidth utilization there was a chance of hitting the low retry count limit. That's why we raised the retries, since raising the timeout would also indirectly influence the amount of extra timeout added (it doubles each time it fails).

In my highly non-scientific testing there seemed to be a trend where if we bumped timeouts alone, reliability would still be bad. So this is just kind of the solution that we converged.

Yes, I've thought about making them configurable settings. However, we did a grid search and these were the best settings that balanced perf for slow and fast networks. I figured a follow-up PR where we break out these as configurables would be good, but let me know if we should bundle the changes. This PR is mostly focused on "saner" defaults.

would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)

Increasing timeout is kinda a bigger change as the number gets increased exponentially after each try.
I think increasing both retry and timeout makes sense. But I could look into that in a follow-up PR instead.

also - wondering if we should make this configurable rather than hard code it?

Yeah, I've thought of that too. It would be a very nice thing to have.

gotcha gotcha

ellemouton

cool LGTM - before the final ✅ though, can you open an LND PR that points to this so that we can just check the CI on that side?

djkazic · 2024-05-30T20:52:38Z

cool LGTM - before the final ✅ though, can you open an LND PR that points to this so that we can just check the CI on that side?

Created here: lightningnetwork/lnd#8797

ellemouton · 2024-05-30T20:53:49Z

cool thanks - will ACK this once the neutrino test on that side is green

lightninglabs-deploy · 2024-06-13T22:06:26Z

@djkazic, remember to re-request review from reviewers when ready

ellemouton

Looking good on the LND side - thanks!

guggero

One question about disk performance on fast systems, otherwise LGTM.

neutrino.go

djkazic requested a review from ellemouton May 29, 2024 11:29

djkazic self-assigned this May 29, 2024

djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch 2 times, most recently from dee0742 to d4c7630 Compare May 29, 2024 16:23

heri16 approved these changes May 29, 2024

View reviewed changes

djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch from d4c7630 to 33c1869 Compare May 29, 2024 23:30

neutrino: improve perf in high latency networks

c8fd04d

djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch from 33c1869 to c8fd04d Compare May 30, 2024 15:40

ellemouton reviewed May 30, 2024

View reviewed changes

djkazic mentioned this pull request May 30, 2024

mod: bump neutrino to testing version lightningnetwork/lnd#8797

Open

ellemouton approved these changes Jun 19, 2024

View reviewed changes

guggero reviewed Jun 20, 2024

View reviewed changes

neutrino.go Show resolved Hide resolved

guggero approved these changes Jun 20, 2024

View reviewed changes

guggero added this pull request to the merge queue Jun 20, 2024

Merged via the queue into lightninglabs:master with commit a50a92a Jun 20, 2024
4 checks passed

djkazic deleted the improve-neutrino-performance-with-slow-networks branch June 20, 2024 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neutrino: improve perf in high latency networks #300

neutrino: improve perf in high latency networks #300

djkazic commented May 29, 2024 •

edited

Loading

djkazic commented May 30, 2024 •

edited

Loading

djkazic commented May 30, 2024

ellemouton May 30, 2024

djkazic May 30, 2024 •

edited

Loading

hsjoberg May 30, 2024

ellemouton May 30, 2024

ellemouton left a comment

djkazic commented May 30, 2024

ellemouton commented May 30, 2024

lightninglabs-deploy commented Jun 13, 2024

ellemouton left a comment

guggero left a comment

neutrino: improve perf in high latency networks #300

neutrino: improve perf in high latency networks #300

Conversation

djkazic commented May 29, 2024 • edited Loading

djkazic commented May 30, 2024 • edited Loading

djkazic commented May 30, 2024

ellemouton May 30, 2024

Choose a reason for hiding this comment

djkazic May 30, 2024 • edited Loading

Choose a reason for hiding this comment

hsjoberg May 30, 2024

Choose a reason for hiding this comment

ellemouton May 30, 2024

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

djkazic commented May 30, 2024

ellemouton commented May 30, 2024

lightninglabs-deploy commented Jun 13, 2024

ellemouton left a comment

Choose a reason for hiding this comment

guggero left a comment

Choose a reason for hiding this comment

djkazic commented May 29, 2024 •

edited

Loading

djkazic commented May 30, 2024 •

edited

Loading

djkazic May 30, 2024 •

edited

Loading