Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove intent queue warnings and dynamically adjust max queue size #3705

Closed
slackpad opened this issue Nov 21, 2017 · 2 comments
Closed

Remove intent queue warnings and dynamically adjust max queue size #3705

slackpad opened this issue Nov 21, 2017 · 2 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Milestone

Comments

@slackpad
Copy link
Contributor

slackpad commented Nov 21, 2017

This will be fixed in Serf, but opening here for tracking and to ensure we pull in the updated Serf into Consul's vendored dependencies.

Serf uses a hard-coded warning threshold of 128 for its intent queue warning, which is tied to a goroutine that runs at 1 Hz, meaning that the warning condition can be incredibly spammy. In large clusters, adding or removing a large number of nodes simultaneously can result in huge runs of these warnings, even though things are fine:

2015/06/26 13:28:39 [WARN] serf: Intent queue depth: 4256

Work was done under #1062 to make this better, but the raw broadcast queue is the same, so it's still possible to get lots of warnings, and because they are related to broadcasts they are likely to trigger across a whole cluster, overwhelming logs.

Here are the proposed fixes:

  1. Remove the queue depth warning entirely and simply emit the gauge metric from the goroutine, which will allow operators to observe the queue behavior and warn if needed based on the size of the cluster.

  2. Make the max queue depth (used for dropping messages) proportional to the size of the cluster. It should have a min of 4096 to support early operations and then expand past that as 2*cluster size. This allows large bursts of messages for any size cluster but still limits the RAM used by Serf.

  3. We can also relax the check interval for the goroutine to something like 30 seconds to let the queue go over the max briefly if needed.

@slackpad slackpad added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Nov 21, 2017
@slackpad slackpad added this to the 1.0.2 milestone Nov 21, 2017
@slackpad slackpad changed the title Clean up intent queue warnings and dynamically adjust size Clean up intent queue warnings and dynamically adjust max queue size Nov 21, 2017
@slackpad slackpad changed the title Clean up intent queue warnings and dynamically adjust max queue size Remove intent queue warnings and dynamically adjust max queue size Nov 21, 2017
@slackpad
Copy link
Contributor Author

slackpad commented Dec 7, 2017

Keep a time stamp for each entry similar to what we did for the node-name based data structure and purge the oldest entries when we are dropping messages. These will be much more likely to be stale and we should do everything we can to avoid spurious rebroadcasts.

^ Removed from the third item above since the queued items are in order already and we prune the ones in the front.

@slackpad
Copy link
Contributor Author

slackpad commented Dec 7, 2017

Also realized that we scan the whole queue when we do TransmitLimitedQueue.QueueBroadcast which might be a little spendy if we let that get too big. I'm thinking we could limit that search to the last N (~128) which is probably a reasonable heuristic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

1 participant