Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new time sliced overlay survey #4275

Merged
merged 1 commit into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Builds/VisualStudio/stellar-core.vcxproj
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,7 @@ exit /b 0
<ClCompile Include="..\..\src\overlay\PeerManager.cpp" />
<ClCompile Include="..\..\src\overlay\PeerSharedKeyId.cpp" />
<ClCompile Include="..\..\src\overlay\RandomPeerSource.cpp" />
<ClCompile Include="..\..\src\overlay\SurveyDataManager.cpp" />
<ClCompile Include="..\..\src\overlay\SurveyManager.cpp" />
<ClCompile Include="..\..\src\overlay\SurveyMessageLimiter.cpp" />
<ClCompile Include="..\..\src\overlay\TCPPeer.cpp" />
Expand Down Expand Up @@ -1031,6 +1032,7 @@ exit /b 0
<ClInclude Include="..\..\src\overlay\PeerSharedKeyId.h" />
<ClInclude Include="..\..\src\overlay\RandomPeerSource.h" />
<ClInclude Include="..\..\src\overlay\StellarXDR.h" />
<ClInclude Include="..\..\src\overlay\SurveyDataManager.h" />
<ClInclude Include="..\..\src\overlay\SurveyManager.h" />
<ClInclude Include="..\..\src\overlay\SurveyMessageLimiter.h" />
<ClInclude Include="..\..\src\overlay\TCPPeer.h" />
Expand Down
6 changes: 6 additions & 0 deletions Builds/VisualStudio/stellar-core.vcxproj.filters
Original file line number Diff line number Diff line change
Expand Up @@ -1065,6 +1065,9 @@
<ClCompile Include="..\..\src\overlay\RandomPeerSource.cpp">
<Filter>overlay</Filter>
</ClCompile>
<ClCompile Include="..\..\src\overlay\SurveyDataManager.cpp">
<Filter>overlay</Filter>
</ClCompile>
<ClCompile Include="..\..\src\overlay\SurveyManager.cpp">
<Filter>overlay</Filter>
</ClCompile>
Expand Down Expand Up @@ -2156,6 +2159,9 @@
<ClInclude Include="..\..\src\overlay\StellarXDR.h">
<Filter>overlay</Filter>
</ClInclude>
<ClInclude Include="..\..\src\overlay\SurveyDataManager.h">
<Filter>overlay</Filter>
</ClInclude>
<ClInclude Include="..\..\src\overlay\SurveyManager.h">
<Filter>overlay</Filter>
</ClInclude>
Expand Down
4 changes: 4 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,12 @@ overlay.outbound.establish | meter | outbound connection esta
overlay.recv.<X> | timer | received message <X>
overlay.send.<X> | meter | sent message <X>
overlay.timeout.idle | meter | idle peer timeout
overlay.recv.start-survey-collecting | timer | time spent in processing request to start survey collecting phase
overlay.recv.stop-survey-collecting | timer | time spent in processing request to stop survey collecting phase
overlay.recv.survey-request | timer | time spent in processing survey request
overlay.recv.survey-response | timer | time spent in processing survey response
overlay.send.start-survey-collecting | timer | sent request to start survey collecting phase
overlay.send.stop-survey-collecting | timer | sent request to stop survey collecting phase
overlay.send.survey-request | meter | sent survey request
overlay.send.survey-response | meter | sent survey response
process.action.queue | counter | number of items waiting in internal action-queue
Expand Down
25 changes: 23 additions & 2 deletions docs/software/admin.md
Original file line number Diff line number Diff line change
Expand Up @@ -764,12 +764,20 @@ There is a survey mechanism in the overlay that allows a validator to request co

By default, a node will relay or respond to a survey message if the message originated from a node in the receiving nodes transitive quorum. This behavior can be overridden by setting `SURVEYOR_KEYS` in the config file to a more restrictive set of nodes to relay or respond to.

The survey works in two phases: the collecting phase, and the reporting phase. During the collecting phase, nodes record information about themselves and their peers, such as the number of messages sent to a given peer. During the reporting phase, the surveyor requests the results of the collecting phase from nodes on the network.

The surveyor begins the collecting phase by broadcasting a `TimeSlicedSurveyStartCollectingMessage`. The surveyor ends the collecting phase and initiates the reporting phase by broadcasting a `TimeSlicedSurveyStopCollectingMessage`. These start/stop collecting messages ensure that the collecting phase is roughly equal for all nodes present for the duration of the collecting phase.

During the reporting phase, the surveyor sends `TimeSlicedSurveyRequestMessage`s to individual nodes to gather the information the node recorded during the collecting phase.

##### Example survey command

In this example, we have three nodes `GBBN`, `GDEX`, and `GBUI` (we'll refer to them by the first four letters of their public keys). We will execute the commands below from `GBUI`, and note that `GBBN` has `SURVEYOR_KEYS=["$self"]` in it's config file, so `GBBN` will not relay or respond to any survey messages.

1. `$ stellar-core http-command 'surveytopology?duration=1000&node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL'`
2. `$ stellar-core http-command 'surveytopology?duration=1000&node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4'`
1. `$ stellar-core http-command 'startsurveycollecting?nonce=1234'`
1. `$ stellar-core http-command 'stopsurveycollecting?nonce=1234'`
1. `$ stellar-core http-command 'surveytopologytimesliced?node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL&inboundpeerindex=0&outboundpeerindex=0'`
2. `$ stellar-core http-command 'surveytopologytimesliced?node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4&inboundpeerindex=0&outboundpeerindex=0'`
3. `$ stellar-core http-command 'getsurveyresult'`

Once the responses are received, the `getsurveyresult` command will return a result like this:
Expand Down Expand Up @@ -821,6 +829,12 @@ Once the responses are received, the `getsurveyresult` command will return a res
"numTotalOutboundPeers" : 0,
"maxInboundPeerCount" : 64,
"maxOutboundPeerCount" : 8,
"addedAuthenticatedPeers" : 0,
"droppedAuthenticatedPeers" : 0,
"p75SCPFirstToSelfLatencyNs" : 121042,
"p75SCPSelfToOtherLatencyNs" : 112452,
"lostSyncCount" : 0,
"isValidator" : false,
"outboundPeers" : null
}
}
Expand All @@ -835,6 +849,7 @@ Notable field definitions
* `badResponseNodes` : List of nodes that sent a malformed response
* `topology` : Map of nodes to connection information
* `inboundPeers`/`outboundPeers` : List of connection information by nodes
* `averageLatencyMs` : Average latency with this peer in milliseconds.
* `bytesRead`: The total number of bytes read from this peer.
* `bytesWritten`: The total number of bytes written to this peer.
* `duplicateFetchBytesRecv`: The number of bytes received that were duplicate transaction sets and quorum sets.
Expand All @@ -853,6 +868,12 @@ Notable field definitions

* `numTotalInboundPeers`/`numTotalOutboundPeers` : The number of total inbound and outbound peers this node is connected to. The response will have a random subset of 25 connected peers per direction (inbound/outbound). These fields tell you if you're missing nodes so you can send another request out to get another random subset of nodes.
* `maxInboundPeerCount`/`maxOutboundPeerCount` : The number of total inbound and outbound peers that this node can accept. These fields correspond to stellar-core configurations `MAX_ADDITIONAL_PEER_CONNECTIONS` and `TARGET_PEER_CONNECTIONS`, respectively.
* `addedAuthenticatedPeers` : The number of authenticated peers added.
* `droppedAuthenticatedPeers` : The number of authenticated peers dropped.
* `p75SCPFirstToSelfLatencyNs` : 75th percentile latency to hear about new SCP messages in nanoseconds.
* `p75SCPSelfToOtherLatencyNs` : 75th percentile latency for other nodes to hear this node's SCP messages in nanoseconds.
* `lostSyncCount` : The number of times this node lost sync.
* `isValidator` : Is this node a validator?

### Quorum Health

Expand Down
55 changes: 48 additions & 7 deletions docs/software/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,25 +357,66 @@ format.

* **surveytopology**
`surveytopology?duration=DURATION&node=NODE_ID`<br>
**This command is deprecated and will be removed in a future release. Use the
new time sliced survey interface instead (`startsurveycollecting`,
`stopsurveycollecting`, `surveytopologytimesliced`, and `getsurveyresults`).**
Starts a survey that will request peer connectivity information from nodes
in the backlog. `DURATION` is the number of seconds this survey will run
for, and `NODE_ID` is the public key you will add to the backlog to survey.
Running this command while the survey is running will add the node to the
backlog and reset the timer to run for `DURATION` seconds. By default, this
node will respond to/relay a survey message if the message originated
from a node in it's transitive quorum. This behaviour can be overridden by adding
keys to `SURVEYOR_KEYS` in the config file, which will be the set of keys to check
instead of the transitive quorum. If you would like to opt-out of this survey mechanism,
just set `SURVEYOR_KEYS` to `$self` or a bogus key
backlog and reset the timer to run for `DURATION` seconds. See [Changing
default survey behavior](#changing-default-survey-behavior) for details about
the default survey behavior, as well as how to change that behavior or opt-out
entirely.

* **stopsurvey**
`stopsurvey`<br>
**This command is deprecated and will be removed in a future release. It is no
longer necessary to explicitly stop a survey in the new time sliced survey
interface as these surveys expire automatically.**
Will stop the survey if one is running. Noop if no survey is running

* **startsurveycollecting**
`startsurveycollecting?nonce=NONCE`<br>
Start a survey in the collecting phase with a given nonce. Does nothing if a
survey is already running on the network as only one survey may run at a time.
See [Changing default survey behavior](#changing-default-survey-behavior) for
details about the default survey behavior, as well as how to change that
behavior or opt-out entirely.

* **stopsurveycollecting**
`stopsurveycollecting`<br>
Stop the collecting phase of the survey started in the previous
`startsurveycollecting` command. Moves the survey into the reporting phase.
Does nothing if no survey is running, or if a different node is running the
active survey.

* **surveytopologytimesliced**
`surveytopologytimesliced?node=NODE_ID&inboundpeerindex=INBOUND_INDEX&outboundpeerindex=OUTBOUND_INDEX`<br>
During the reporting phase of a survey, invoke this command to request
information recorded during the collecting phase from `NODE_ID`. This command
adds the survey request to a backlog; it does not immediately send the
request. Use `getsurveyresult` to see the response. A response will include
information about up to 25 inbound and outbound peers respectively. If a node
has more than 25 inbound and/or outbound peers, you will need to survey the
node multiple times to get the complete peer list. You can request peers
starting from a specific index in each peer list by setting `INBOUND_INDEX`
and `OUTBOUND_INDEX` appropriately. See [Changing default survey
behavior](#changing-default-survey-behavior) for details about the default
survey behavior, as well as how to change that behavior or opt-out entirely.

* **getsurveyresult**
`getsurveyresult`<br>
Returns the current survey results. The results will be reset every time a new survey
is started
is started. Use this command for both the time sliced survey interface as well
as the old deprecated survey interface.

#### Changing default survey behavior
By default, this node will respond to/relay a survey message if the message
originated from a node in its transitive quorum. This behavior can be overridden
by adding keys to `SURVEYOR_KEYS` in the config file, which will be the set of
keys to check instead of the transitive quorum. If you would like to opt-out of
this survey mechanism, just set `SURVEYOR_KEYS` to `$self` or a bogus key

### The following HTTP commands are exposed on test instances
* **generateload** `generateload[?mode=
Expand Down
7 changes: 7 additions & 0 deletions docs/stellar-core_example.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,13 @@ ALLOW_LOCALHOST_FOR_TESTING=false
# before applying transactions.
CATCHUP_WAIT_MERGES_TX_APPLY_FOR_TESTING=false

# ARTIFICIALLY_SET_SURVEY_PHASE_DURATION_FOR_TESTING (in minutes), defaults to
# no override. Overrides the maximum survey phase duration for both the
# collecting and reporting phase to the specified value. Performs no override if
# set to 0. Do not use in production. This option is ignored in builds without
# tests enabled.
ARTIFICIALLY_SET_SURVEY_PHASE_DURATION_FOR_TESTING=0

# PEER_READING_CAPACITY defaults to 200
# Controls how many messages from a particular peer
# core can process simultaneously, and throttles reading from a peer when at
Expand Down
8 changes: 6 additions & 2 deletions hash-xdrs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@ namespace stellar {
extern const std::vector<std::pair<std::filesystem::path, std::string>> XDR_FILES_SHA256 = {
EOF

sha256sum -b $1/xdr/*.x | grep -v Stellar-internal | perl -pe 's/([a-f0-9]+)[ \*]+(.*)/{"$2", "$1"},/'
# Hashes to ignore
IGNORE="Stellar-internal\|Stellar-overlay\|Stellar-contract-spec\|Stellar-contract-meta\|Stellar-contract-env-meta"

echo '{"", ""}};'
sha256sum -b $1/xdr/*.x | grep -v "${IGNORE}" | perl -pe 's/([a-f0-9]+)[ \*]+(.*)/{"$2", "$1"},/'

# Add empty entries for the 5 skipped files
echo '{"", ""}, {"", ""}, {"", ""}, {"", ""}, {"", ""}};'
echo '}'
16 changes: 16 additions & 0 deletions src/herder/HerderSCPDriver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
#include "ledger/LedgerManager.h"
#include "main/Application.h"
#include "main/ErrorMessages.h"
#include "overlay/OverlayManager.h"
#include "overlay/SurveyManager.h"
#include "scp/SCP.h"
#include "scp/Slot.h"
#include "util/Logging.h"
Expand Down Expand Up @@ -1034,6 +1036,13 @@ HerderSCPDriver::recordSCPExternalizeEvent(uint64_t slotIndex, NodeID const& id,
mSCPMetrics.mFirstToSelfExternalizeLag,
"first to self externalize lag",
std::chrono::nanoseconds::zero(), slotIndex);
mApp.getOverlayManager().getSurveyManager().modifyNodeData(
[&](CollectingNodeData& nd) {
nd.mSCPFirstToSelfLatencyNsHistogram.Update(
std::chrono::duration_cast<std::chrono::nanoseconds>(
now - *timing.mFirstExternalize)
.count());
});
}
if (!timing.mSelfExternalize || forceUpdateSelf)
{
Expand All @@ -1052,6 +1061,13 @@ HerderSCPDriver::recordSCPExternalizeEvent(uint64_t slotIndex, NodeID const& id,
fmt::format(FMT_STRING("self to {} externalize lag"),
toShortString(id)),
std::chrono::nanoseconds::zero(), slotIndex);
mApp.getOverlayManager().getSurveyManager().modifyNodeData(
[&](CollectingNodeData& nd) {
nd.mSCPSelfToOtherLatencyNsHistogram.Update(
std::chrono::duration_cast<std::chrono::nanoseconds>(
now - *timing.mFirstExternalize)
.count());
});
}

// Record lag for other nodes
Expand Down
Loading
Loading