Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new time sliced overlay survey #4275

Merged
merged 1 commit into from
May 21, 2024
Merged

Conversation

bboston7
Copy link
Contributor

@bboston7 bboston7 commented Apr 5, 2024

Description

Resolves #4169, #4164. Partially resolves #2592 (script update remains).

This pull request adds a new survey mode that aims to be more deterministic and easier to compare across runs. It also adds some additional fields to the survey schema. It is very faithful to the design outlined in the Unified Network Survey Work Proposal.

While this PR includes all of the new functionality, it is still missing a few things I'd like to do before it's completely done:

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

Copy link
Contributor

@marta-lokhova marta-lokhova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpointing here: I'm still reviewing data collection/relaying portion. Thanks for the changes! Appreciate how clean and meticulous the code is :) Most of my suggestions are minor; I did leave a few questions around backward compatibility when connecting to old versions of core as well as the future of old-style survey.

src/main/CommandHandler.cpp Outdated Show resolved Hide resolved
src/main/CommandHandler.cpp Show resolved Hide resolved
src/main/CommandHandler.cpp Show resolved Hide resolved
src/overlay/SurveyDataManager.h Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.h Outdated Show resolved Hide resolved
@@ -221,6 +222,7 @@ class Application
virtual WorkScheduler& getWorkScheduler() = 0;
virtual BanManager& getBanManager() = 0;
virtual StatusManager& getStatusManager() = 0;
virtual SurveyDataManager& getSurveyDataManager() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bump OVERLAY_PROTOCOL_VERSION to indicate a version of core that supports new-style survey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a108620

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if this change lands in v21.1.0, we will need to bump overlay version again from 33 to 34, as we expect to bump it to 33 in v21.0.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to bump the version to 34.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped in 0859d27

src/overlay/SurveyManager.cpp Show resolved Hide resolved
src/overlay/Peer.cpp Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.h Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Show resolved Hide resolved
src/main/ApplicationImpl.h Outdated Show resolved Hide resolved
src/overlay/SurveyManager.cpp Show resolved Hide resolved
src/overlay/SurveyMessageLimiter.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Show resolved Hide resolved
src/overlay/SurveyManager.cpp Show resolved Hide resolved
@bboston7
Copy link
Contributor Author

bboston7 commented May 2, 2024

I just squashed and rebased this on top of master to pull in the new overlay changes.

src/overlay/SurveyDataManager.cpp Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyManager.h Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.h Outdated Show resolved Hide resolved
src/overlay/SurveyDataManager.h Outdated Show resolved Hide resolved
@@ -221,6 +222,7 @@ class Application
virtual WorkScheduler& getWorkScheduler() = 0;
virtual BanManager& getBanManager() = 0;
virtual StatusManager& getStatusManager() = 0;
virtual SurveyDataManager& getSurveyDataManager() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if this change lands in v21.1.0, we will need to bump overlay version again from 33 to 34, as we expect to bump it to 33 in v21.0.0.

Copy link
Contributor

@marta-lokhova marta-lokhova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates! I spent a bit more time thinking through non-happy-path scenarios, and left a few suggestions there. I'd say we should merge this PR as-is, then just do a follow-up with the last few tweaks (to make PR management easier). Looks like there are some merge conflicts, so let's just address those and squash commits. Thanks again!

src/overlay/test/OverlayManagerTests.cpp Show resolved Hide resolved
@@ -221,6 +222,7 @@ class Application
virtual WorkScheduler& getWorkScheduler() = 0;
virtual BanManager& getBanManager() = 0;
virtual StatusManager& getStatusManager() = 0;
virtual SurveyDataManager& getSurveyDataManager() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to bump the version to 34.

src/overlay/SurveyDataManager.cpp Outdated Show resolved Hide resolved
SurveyPhase phase();

// Reset survey data. Intended to be called when survey data expires.
void reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is expiration only detected when we call phase? If so, I think that means we don't know when survey data is going to be reset exactly. I think this is fine now since we call it every time we call modifyNodeData, but that seems like a footgun in case implementation of modifyNodeData changes. To be safe, can we check expiration and reset survey on a periodic basis? Simplest i think is to add the check to OverlayManager::tick, which is called every few seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good point about this being easy to screw up if/when we modify this in the future. I got rid of the phase() function and replaced it with a member variable mPhase that the various start/stop functions update. I also added a function updateSurveyPhase that tick calls that serves to reset the survey on expiration.

bool
SurveyDataManager::nonceIsReporting(uint32_t nonce)
{
return phase() == SurveyPhase::REPORTING && mNonce == nonce;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these two conditions be swapped? This way we won't reset the current survey if another surveyor erroneously sends a survey request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(actually, this probably suggests that phase shouldn't be modifying state, and is related to my question around periodically checking state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal of the phase() function I don't think order matters here anymore

{
ZoneScoped;

if (phase() == SurveyPhase::REPORTING && mNonce == request.nonce &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, should we check nonce first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also no longer matters with the removal of the phase() function

@@ -16,6 +18,74 @@ namespace stellar

uint32_t const SurveyManager::SURVEY_THROTTLE_TIMEOUT_MULT(3);

uint32_t constexpr TIME_SLICED_SURVEY_MIN_OVERLAY_PROTOCOL_VERSION = 33;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: this should be 34.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0859d27

// Collecting phase is limited to 2 hours. If 2 hours pass without receiving
// a StopSurveyCollecting message the `SurveyDataManager` will reset all data
// and transition to the `INACTIVE` phase.
constexpr std::chrono::hours COLLECTING_PHASE_MAX_DURATION{2};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in using this timeout to mark the end of collecting phase, and still allow surveyors to query results? Wondering if this should set to something we'd expect to use in reality, like 30 minutes? This may help maximize chances of more nodes responding to survey, as I imagine there'll be situations where messages are dropped due to our filtering policies (we'd still use the stop message to end survey earlier when needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I shortened the max collecting phase duration to 30 minutes and changed the code to automatically transition to the reporting phase if the collecting phase expires in 160d1f8

@bboston7 bboston7 marked this pull request as ready for review May 16, 2024 23:19
@bboston7
Copy link
Contributor Author

I've rebased and squashed this down to a single commit in preparation for merging!

@marta-lokhova
Copy link
Contributor

r+ 075df81

@latobarita latobarita merged commit 7676070 into stellar:master May 21, 2024
15 checks passed
@bboston7 bboston7 deleted the survey branch September 16, 2024 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider improving network survey schema overlay survey should look and act more deterministic
4 participants