Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

streamer send destination metrics for repair, gossip #21564

Merged
merged 16 commits into from
Dec 17, 2021

Conversation

jbiseda
Copy link
Contributor

@jbiseda jbiseda commented Dec 2, 2021

Problem

There may be unnecessary outgoing repair traffic. Determine distribution of target hosts for nodes with high outgoing packet count.

Summary of Changes

Track outgoing packet destination metrics from streamer::recv_send. Tracks repair and gossip. Records target address distribution.

Fixes #

@@ -171,6 +279,7 @@ pub fn responder(
last_print = now;
errors = 0;
}
stats.maybe_submit(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a datapoint already using the name. I'm not sure if it conflicts, seems like it should be ok though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using name to log "errors" in the responder loop. There shouldn't be a conflict.

@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #21564 (cfc0514) into master (5a28f61) will increase coverage by 0.0%.
The diff coverage is 92.6%.

@@           Coverage Diff            @@
##           master   #21564    +/-   ##
========================================
  Coverage    81.3%    81.3%            
========================================
  Files         516      517     +1     
  Lines      144185   144278    +93     
========================================
+ Hits       117237   117346   +109     
+ Misses      26948    26932    -16     

Comment on lines 139 to 144
const SUBMIT_CADENCE: Duration = Duration::from_secs(10);
const MAX_REPORT_ENTRIES: usize = 5;
let elapsed = self.since.as_ref().map(Instant::elapsed);
if elapsed.map(|e| e < SUBMIT_CADENCE).unwrap_or_default() {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably also needs to submit stats if for example, host_map.len() > 1000, or something, just so that the hashmap does not grow too large.

Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just wondering if the added overhead is tolerable

),
(
"streamer-send-host_bytes_10pct",
hist.percentile(10.0).unwrap_or_default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if these hist.percentile calculations are efficient enough not too cause too much overhead/stall here.

Copy link
Contributor Author

@jbiseda jbiseda Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's concerning. For a histogram with 1000 datapoints I'm seeing ~600us for all the histogram samples taken in datapoint_info (cumulative).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that mean for the streamer throughput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reporting is run on a 10s interval, but I could move this up a level out of the streamer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated changes with a new reporting thread to minimize overhead in the streamer thread. Thoughts?

let ent = self
.host_map
.entry(pkt.meta.addr)
.or_insert_with(SendStats::default);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@behzadnouri
Copy link
Contributor

lgtm, plz confirm with Stephan that there is no concern re overhead added to streamer

@jbiseda jbiseda requested a review from sakridge December 7, 2021 08:41
@jbiseda
Copy link
Contributor Author

jbiseda commented Dec 15, 2021

Collected some measurements:
duration_recv_send: sum of duration for all recv_send() calls for 10s reporting period in micros.
duration_record: sum of duration for recording per-packet data in host_map in micros.
duration_submit: sum of duration of calls to maybe_submit() in micros.
~2% overhead.

Screen Shot 2021-12-14 at 6 18 21 PM

@sakridge thoughts?

Copy link
Contributor

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lgtm

@jbiseda jbiseda merged commit 97a1fa1 into solana-labs:master Dec 17, 2021
@jbiseda jbiseda deleted the repair-tracking branch December 17, 2021 23:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants