Flume, Metrics and Message control when overloaded #367

jsdw · 2021-08-12T15:06:53Z

Make use of flume throughout. It seems to benchmark well against the futures channels I've been using (see https://github.com/zesterer/flume/blob/master/misc/benchmarks.png) but importantly it also exposes a len() fn so we can see how many messages are queued in each channel.
Gather and expose metrics. The above gives us access to some useful metrics on top of what we'd already have. These are exposed in a format compatible with Prometheus (tested using a local version of prometheus).
Test util updates. Remove the "worse" soak test runner, and add options to make it easier to test a few different things, and easier to scale the test runner to use more cores (which is needed for larger scale tests).

Builds on https://github.com/paritytech/substrate-telemetry/tree/jsdw-minor-fixes. Merge this after that.

jsdw · 2021-08-12T15:17:10Z

Note to self: change base to master, and rebase/merge this after the jsdw-minor-fixes stuff merges

jsdw · 2021-08-12T15:59:37Z

(force push to undo a commit I put on the wrong branch :))

dvdplm

LGTM. Left a few nits and suggestions (likely for future work).

dvdplm · 2021-08-12T17:28:41Z

backend/common/src/dense_map.rs

+    use super::*;
+
+    #[test]
+    fn len_doesnt_panic_if_lots_of_retired() {


Did you mean "removed" here?

retired is referring to the list of retired IDs in the densemap struct :)

I could probably have made the grammar in that name a bit less awful!

(I made the name less awful!)

dvdplm · 2021-08-12T17:33:45Z

backend/common/src/ws_client/receiver.rs

 use futures::{Stream, StreamExt};
 use std::sync::Arc;

 /// Receive messages out of a connection
 pub struct Receiver {
-    pub(super) inner: mpsc::UnboundedReceiver<Result<RecvMessage, RecvError>>,
+    pub(super) inner: flume::r#async::RecvStream<'static, Result<RecvMessage, RecvError>>,


Eew, that is an unfortunate choice of module name.

Yeah, I thought so! If I had to write that too much I'd probably use and rename it!

dvdplm · 2021-08-12T17:37:22Z

backend/telemetry_core/src/aggregator/aggregator.rs

-    // in to the aggregator. If nobody is tolding the tx side of the channel
-    // any more, this task will gracefully end.
+    /// This is spawned into a separate task and handles any messages coming
+    /// in to the aggregator. If nobody is tolding the tx side of the channel


Suggested change

/// in to the aggregator. If nobody is tolding the tx side of the channel

/// in to the aggregator. If nobody is holding the tx side of the channel

dvdplm · 2021-08-12T17:41:32Z

backend/telemetry_core/src/aggregator/inner_loop.rs

+    pub chains_subscribed_to: usize,
+    /// How many feeds are currently subscribed to something.
+    pub subscribed_feeds: usize,
+    /// How many feeds have asked for finality information, too.


What "too" was hinting at is that it's the number of feeds that are subscribed to a chain and also have asked for finality (because you can't ask for finality without being subscribed)

I see. How about this: "Number of subscribed feeds that also asked for finality information."?

dvdplm · 2021-08-12T17:42:24Z

backend/telemetry_core/src/aggregator/inner_loop.rs

+    pub total_messages_to_feeds: usize,
+    /// How many messages are queued waiting to be handled by this aggregator.
+    pub total_messages_to_aggregator: usize,
+    /// How many nodes are currently known about by this aggregator.


Suggested change

/// How many nodes are currently known about by this aggregator.

/// How many nodes are currently known to this aggregator.

?

backend/telemetry_core/src/aggregator/inner_loop.rs

dvdplm · 2021-08-12T17:46:59Z

backend/telemetry_core/src/aggregator/inner_loop.rs

@@ -100,6 +101,30 @@ pub enum FromFeedWebsocket {
    Disconnected,
 }

+/// A set of metrics returned when we ask for metrics
+#[derive(Clone, Debug, Default)]
+pub struct Metrics {


Thinking out loud here: it would be interesting to have a few histograms too, e.g. nodes and their verbosity level (to get an inkling to how much data we're receiving), message payload size in and out, maybe connection longevity too (how often do nodes stay connected).
Another data point I wouldn't mind having is the current message rate (is that what they call a "gauge"?).

I think I return the current message rates already (I think guage is the right term.. there was a word for numbers that are always increasing and a word for values that can go up or down and I think guage was the latter!). I think we can see the bandwidth anyway from outside the process, so I wonder how useful the bytes in/out are? But that histogram sounds like a good idea; roughly breaking down the bytes in/out per node/feed could be interesting!

When this merges, perhaps we should create an issue to track any additional metrics we'd like to add?

When this merges, perhaps we should create an issue to track any additional metrics we'd like to add?

You read my mind.

dvdplm · 2021-08-12T17:49:53Z

backend/telemetry_core/src/aggregator/inner_loop.rs

+                    msg,
+                    ToAggregator::FromShardWebsocket(.., FromShardWebsocket::Update { .. })
+                ) {
+                    continue;


I'm tempted to ask for a log here, but maybe that's a bad thing in this scenario, logging when the node is already struggling?

Either way, I think collecting metrics on the number of dropped messages might prove useful.

Ooh yeah, that's a really good number to have! I might sneak that into this PR as I don't think it'll be that hard to gather and it would be useful to see reported!

dvdplm · 2021-08-12T17:52:08Z

backend/telemetry_core/src/main.rs

+    // we just split out the text format that prometheus expects ourselves, using whatever the latest metrics that we've
+    // captured so far from the aggregators are. See:


Suggested change

// we just split out the text format that prometheus expects ourselves, using whatever the latest metrics that we've

// captured so far from the aggregators are. See:

// we just split out the text format that prometheus expects ourselves, and use the latest metrics that we've

// captured so far from the aggregators. See:

insipx · 2021-08-12T23:50:34Z

backend/telemetry_core/src/aggregator/aggregator_set.rs

+
+                    // Sleep *at least* 10 seconds. If it takes a while to get metrics back, we'll
+                    // end up waiting longer between requests.
+                    tokio::time::sleep_until(now + tokio::time::Duration::from_secs(10)).await;


why sleep_until instead of just sleep? do we want to sleep 10seconds from the start of execution or at least 10 seconds after the metrics update?

I could have gone either way really, but I went for "aim to have 10 seconds between each update to metrics, unless it takes more than 10 seconds for an update to come through"

(To explain the reasoning a bit more; the actual logic to gather metrics is fairly quick, but potentially the message could end up in a queue for a while if we happen to be under heavy load, so if it takes eg 9seconds to get a response, I'd rather put another message into the queue 1 second later than wait another 10 seconds before queueing another message)

insipx · 2021-08-12T23:54:17Z

backend/telemetry_core/src/aggregator/inner_loop.rs

@@ -39,6 +37,9 @@ pub enum ToAggregator {
    FromShardWebsocket(ConnId, FromShardWebsocket),
    FromFeedWebsocket(ConnId, FromFeedWebsocket),
    FromFindLocation(NodeId, find_location::Location),
+    /// Hand back some metrics. The provided sender is expected not to block when
+    /// a message it sent into it.


Suggested change

/// a message it sent into it.

/// a message is sent into it.

insipx · 2021-08-13T00:01:36Z

looks good to me apart from a couple questions. Overall I've had a good experience with flume, I like how Sync/Async methods are both available on the same Sender/Receiver objects

jsdw · 2021-08-13T10:11:30Z

looks good to me apart from a couple questions. Overall I've had a good experience with flume, I like how Sync/Async methods are both available on the same Sender/Receiver objects

That's good to hear! I like the flexibility it has there too, and unlike the tokio channels it also has Stream/Sink impls, and unlike futures/tokio channels you can see lengths (and on top of all of that, it's supposed to be pretty quick) :)

dvdplm · 2021-08-13T12:14:53Z

backend/common/src/dense_map.rs

@@ -138,7 +138,7 @@ mod test {
    use super::*;

    #[test]
-    fn len_doesnt_panic_if_lots_of_retired() {
+    fn len_doesnt_panic_if_lots_of_ids_are_retired() {


dvdplm · 2021-08-13T12:16:33Z

Last changes still lgtm! Merge at will.

jsdw added 18 commits August 12, 2021 16:00

monitor aggregator length (dont discard msgs yet)

87c0ee7

monitoring queue len

b97aec9

Try to force new thread for msg counter to ensure it has time to print

968dd2b

fmt, clean warnings, tidy aggregator opts and add queue length limit

98c9ccd

print feed 1 msg len

8268cf2

use flume throughout telemetry_core

703a9dd

remove final use of futures::mpsc and replace with flume

11b0b3a

Flumify everything

bd7a21e

Clean up soak test runner and add more config options

e3fcd4e

test runner: enable tokio features

20463ce

test runner: fix soak test for multiple ids per ndoe

f72f8c1

Add periodic interval to core loop and print debug info

3319709

more diagnostic logging

ab2303c

Fix compile err with diagnostic msg

6db7f48

Confirm that densemap len wont panic if lots of retired items

4f7b2c8

Expose metrics in a format that prometheus understands

92da674

Add comment explaining prometheus metrics endpoint body

9017f32

cargo fmt

2309870

jsdw changed the base branch from master to jsdw-minor-fixes August 12, 2021 15:07

jsdw marked this pull request as ready for review August 12, 2021 15:16

Fix/expand a few comments

05a3ba3

jsdw force-pushed the jsdw-sharding-gatekeeper branch from 5c4b3cb to 05a3ba3 Compare August 12, 2021 15:57

dvdplm approved these changes Aug 12, 2021

View reviewed changes

insipx reviewed Aug 12, 2021

View reviewed changes

insipx approved these changes Aug 13, 2021

View reviewed changes

Base automatically changed from jsdw-minor-fixes to master August 13, 2021 09:51

jsdw added 4 commits August 13, 2021 11:16

Merge branch 'master' into jsdw-sharding-gatekeeper

811babc

expose dropped message counts and fix some typos/wording

b842c7f

cargo fmt

77460ff

Clarify wording

46b0641

dvdplm reviewed Aug 13, 2021

View reviewed changes

jsdw merged commit 502fd2e into master Aug 13, 2021

jsdw deleted the jsdw-sharding-gatekeeper branch August 13, 2021 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flume, Metrics and Message control when overloaded #367

Flume, Metrics and Message control when overloaded #367

jsdw commented Aug 12, 2021 •

edited

Loading

jsdw commented Aug 12, 2021

jsdw commented Aug 12, 2021

dvdplm left a comment

dvdplm Aug 12, 2021

jsdw Aug 13, 2021 •

edited

Loading

jsdw Aug 13, 2021

dvdplm Aug 12, 2021

jsdw Aug 13, 2021

dvdplm Aug 12, 2021

dvdplm Aug 12, 2021

jsdw Aug 13, 2021

dvdplm Aug 13, 2021

jsdw Aug 13, 2021

dvdplm Aug 12, 2021

dvdplm Aug 12, 2021

jsdw Aug 13, 2021

dvdplm Aug 13, 2021

dvdplm Aug 12, 2021

jsdw Aug 13, 2021

dvdplm Aug 13, 2021

jsdw Aug 13, 2021

dvdplm Aug 12, 2021

insipx Aug 12, 2021 •

edited

Loading

jsdw Aug 13, 2021

jsdw Aug 13, 2021

insipx Aug 12, 2021

insipx commented Aug 13, 2021 •

edited

Loading

jsdw commented Aug 13, 2021

dvdplm Aug 13, 2021

dvdplm commented Aug 13, 2021

	/// in to the aggregator. If nobody is tolding the tx side of the channel
	/// in to the aggregator. If nobody is holding the tx side of the channel

	/// How many nodes are currently known about by this aggregator.
	/// How many nodes are currently known to this aggregator.

		// we just split out the text format that prometheus expects ourselves, using whatever the latest metrics that we've
		// captured so far from the aggregators are. See:

	/// a message it sent into it.
	/// a message is sent into it.

Flume, Metrics and Message control when overloaded #367

Flume, Metrics and Message control when overloaded #367

Conversation

jsdw commented Aug 12, 2021 • edited Loading

jsdw commented Aug 12, 2021

jsdw commented Aug 12, 2021

dvdplm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsdw Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

insipx Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

insipx commented Aug 13, 2021 • edited Loading

jsdw commented Aug 13, 2021

Choose a reason for hiding this comment

dvdplm commented Aug 13, 2021

jsdw commented Aug 12, 2021 •

edited

Loading

jsdw Aug 13, 2021 •

edited

Loading

insipx Aug 12, 2021 •

edited

Loading

insipx commented Aug 13, 2021 •

edited

Loading