Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neutrino: improve perf in high latency networks #300

Conversation

djkazic
Copy link
Contributor

@djkazic djkazic commented May 29, 2024

After working with @hsjoberg and @niteshbalusu11 to troubleshoot Blixt performance we realized two things were consistently going wrong in networks with > around 300ms latency:

  1. Retries would be exhausted quickly
  2. Due to that --^ timeouts being doubled did not help much

After investigating, we decided to experiment with changing some constants in neutrino to work better.

This PR addresses both issues by:

  • Decreasing maxBatch to 10
  • Increasing query retries from 2 to 8
  • Increasing rescan retries from 1 to 2
  • Increasing broadcast query retries from 1 to 8

The result is that we can now use Blixt with networks that we could not previously.

@djkazic djkazic requested a review from ellemouton May 29, 2024 11:29
@djkazic djkazic self-assigned this May 29, 2024
@djkazic djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch 2 times, most recently from dee0742 to d4c7630 Compare May 29, 2024 16:23
@djkazic djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch from d4c7630 to 33c1869 Compare May 29, 2024 23:30
@djkazic
Copy link
Contributor Author

djkazic commented May 30, 2024

OK, it appears that queryAllPeers breaks the unit test when the numRetries of the defaultQueryOptions struct is > 1.

@djkazic djkazic force-pushed the improve-neutrino-performance-with-slow-networks branch from 33c1869 to c8fd04d Compare May 30, 2024 15:40
@djkazic
Copy link
Contributor Author

djkazic commented May 30, 2024

Cool, so I fixed the unit test.

Now, the Blixt team will do some testing to make sure that we haven't also rolled back the perf improvements.

@@ -956,7 +956,7 @@ func (rs *rescanState) handleBlockConnected(ntfn *blockntfns.Connected) error {

// Otherwise, we'll attempt to fetch the filter to retrieve the relevant
// transactions and notify them.
queryOptions := NumRetries(0)
queryOptions := NumRetries(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)

also - wondering if we should make this configurable rather than hard code it?

Copy link
Contributor Author

@djkazic djkazic May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. So the timeouts and retries are directly connected to one another, as when we have a timeout we'll do a retry (bounded by the retry count). What we found from trace logs was that the networks didn't consistently timeout, but rather during higher bandwidth utilization there was a chance of hitting the low retry count limit. That's why we raised the retries, since raising the timeout would also indirectly influence the amount of extra timeout added (it doubles each time it fails).

In my highly non-scientific testing there seemed to be a trend where if we bumped timeouts alone, reliability would still be bad. So this is just kind of the solution that we converged.

  1. Yes, I've thought about making them configurable settings. However, we did a grid search and these were the best settings that balanced perf for slow and fast networks. I figured a follow-up PR where we break out these as configurables would be good, but let me know if we should bundle the changes. This PR is mostly focused on "saner" defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)

Increasing timeout is kinda a bigger change as the number gets increased exponentially after each try.
I think increasing both retry and timeout makes sense. But I could look into that in a follow-up PR instead.

also - wondering if we should make this configurable rather than hard code it?

Yeah, I've thought of that too. It would be a very nice thing to have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha gotcha

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool LGTM - before the final ✅ though, can you open an LND PR that points to this so that we can just check the CI on that side?

@djkazic
Copy link
Contributor Author

djkazic commented May 30, 2024

cool LGTM - before the final ✅ though, can you open an LND PR that points to this so that we can just check the CI on that side?

Created here: lightningnetwork/lnd#8797

@ellemouton
Copy link
Member

cool thanks - will ACK this once the neutrino test on that side is green

@lightninglabs-deploy
Copy link

@djkazic, remember to re-request review from reviewers when ready

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good on the LND side - thanks!

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about disk performance on fast systems, otherwise LGTM.

neutrino.go Show resolved Hide resolved
@guggero guggero added this pull request to the merge queue Jun 20, 2024
Merged via the queue into lightninglabs:master with commit a50a92a Jun 20, 2024
4 checks passed
@djkazic djkazic deleted the improve-neutrino-performance-with-slow-networks branch June 20, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants