Skip to content

Conversation

@onetechnical
Copy link
Contributor

@onetechnical onetechnical commented Oct 25, 2021

catchup: ignore benign evaluator failures (#3135)
txnsync: delete-able incoming message queue (#3126)
txnsync: Improve Bandwidth Estimation in Txnsync (#3096)
network: update the websocket library. (#3131)
txnsync: Add err to solicitedAsyncTxHandler.loop log message (#3123)
network: add initial support for latency tracking (#3028)
txnsync: fix potential race during TestBasicCatchpointCatchup (#3033)

tsachiherman and others added 8 commits October 25, 2021 10:32
## Summary

During fast catchup, we restart the transaction sync service very quickly.
This can cause a network message being sent, and the response would be returned to the "restarted" txnsync.

Since we don't want to disconnect the network connection itself ( which could have some messages enqueued ), the transaction sync would need to store the "returned" channel before sending the message. That would avoid the data race ( and safely ignore the incoming message ).

## Test Plan

Use existing testing, and confirm against that.
## Summary

This PR adds the node ( both client and server ), the ability to measure the time it takes to establish an outgoing connection ( excluding the TCP connection time ).
This duration is captured as the initial latency, which would need to get updated via a pingpong style logic.

## Test Plan

- [x] Extend existing unit tests
- [x] mainnet-model testing is needed as well to confirm correctness
## Summary

I noticed log messages like this:
```
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220089Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220226Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220300Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228731Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228828Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228893Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228946Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.229012Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.229080Z"}
```

## Test Plan

No test changes.
## Summary

The previous web socket version was not flushing the write buffer when writing a Close control message.
As a result, we ended up disconnection the connection on one side correctly, while having it in a zombie state on the other side.

## Test Plan

Use existing tests. The `TestSlowPeerDisconnection` was already observing the issue being resolved.

## Summary


Improve the bandwidth estimation within the transaction sync by having the estimation account for latency, transaction compression time, and time spent waiting in the incoming queue.

## Test Plan


Wrote unit tests for correctness, ran network on mainnet model and observed measured bandwidths. Before the bandwidth would converge to the minimum over time as well have erratic inaccuracies. Now the numbers look much more in range, at most a factor of 2 off.
## Summary

Running betanet, we've seen errors such as
```
unable to enqueue incoming message into peer incoming message backlog. disconnecting from peer.
```
as well as multiple
```
Peer 162.202.32.72:56543 disconnected: SlowConnection
```

The goal of this PR is to remove from the pending incoming message queue entries that are associated with network peers that have been disconnected ( by the network package ), or have been scheduled for disconnection ( by the transaction synch package ).

Removing these would increase the "available" space in the incoming message queue and would prevent redundant disconnections.

## Test Plan

- [x] unit tests updated.
- [x] ran s1 network with and without load, successfully.
## Summary

The catchup was occasionally reporting
```
(1): fetchAndWrite(13932148): ledger write failed: block evaluation for round 13932148 requires sequential evaluation while the latest round is 13932148
```

This issue indicates that the catchup was attempting to validate a block which is not the latest+1, but rather newer.
In this case, we can safely ignore this error, and skip applying this block, since the block was already added to the ledger.

## Test Plan

Tested manually.
@codecov-commenter
Copy link

codecov-commenter commented Oct 25, 2021

Codecov Report

Merging #3139 (615fee2) into rel/beta (53cf013) will increase coverage by 0.07%.
The diff coverage is 81.88%.

Impacted file tree graph

@@             Coverage Diff              @@
##           rel/beta    #3139      +/-   ##
============================================
+ Coverage     43.84%   43.91%   +0.07%     
============================================
  Files           385      386       +1     
  Lines         86323    86471     +148     
============================================
+ Hits          37849    37978     +129     
- Misses        42519    42520       +1     
- Partials       5955     5973      +18     
Impacted Files Coverage Δ
catchup/service.go 68.81% <0.00%> (-0.54%) ⬇️
config/config.go 47.82% <ø> (ø)
data/txHandler.go 9.13% <0.00%> (ø)
network/wsNetwork.go 64.70% <ø> (+1.93%) ⬆️
node/txnSyncConn.go 0.00% <0.00%> (ø)
txnsync/interfaces.go 100.00% <ø> (ø)
network/wsPeer.go 70.93% <60.00%> (-3.16%) ⬇️
network/latencyTracker.go 65.95% <65.95%> (ø)
txnsync/incoming.go 94.82% <76.00%> (-5.18%) ⬇️
txnsync/incomingMsgQ.go 89.38% <89.38%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53cf013...615fee2. Read the comment docs.

@algojohnlee algojohnlee merged commit 378816d into rel/beta Oct 26, 2021
@algobarb algobarb deleted the rel/beta-3.1.3 branch January 5, 2022 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants