Skip to content

Conversation

joshua-kim
Copy link
Contributor

@joshua-kim joshua-kim commented Aug 23, 2023

Why this should be merged

Implements a sliding window throttler to the SDK: https://blog.cloudflare.com/counting-things-a-lot-of-different-things/

This works by having the caller configure an evaluation period and a maximum amount of calls per evaluation period. A current and previous window of calls are tracked, where each window contains a map of nodes to the amount of calls they made. To determine if a call for a node should be throttled or not, we estimate how many calls they've made in the last window of time.

This is equivalent to the amount of calls they made in the current evaluation period (always accurate), plus a weighted sum of the amount of calls they made in the previous evaluation period (estimation).

How this works

See article. We track two windows, the current and previous evaluation period. We approximate TPS of a request by adding the current window (exact number), plus a weighted sum of the previous evaluation number (estimation).

How this was tested

Added unit test

@joshua-kim joshua-kim changed the base branch from master to sdk-noop-handler August 23, 2023 18:16
@joshua-kim joshua-kim self-assigned this Aug 23, 2023
@joshua-kim joshua-kim added the sdk This involves SDK tooling or frameworks label Aug 23, 2023
@joshua-kim joshua-kim requested a review from danlaine August 23, 2023 18:44
Base automatically changed from sdk-noop-handler to dev August 23, 2023 19:06
@joshua-kim
Copy link
Contributor Author

joshua-kim commented Aug 24, 2023

Had a chat w/ Stephen and realized that it's not really possible to use token buckets here since we need a way to clean up old buckets that aren't used anymore, but also make it not possible for people to reset their throttling bucket by just disconnecting and reconnecting. We probably only want to throttle people if they're a validator, we can use the validator handler we're implementing to drop messages before passing them into this throttling handler.

@joshua-kim joshua-kim marked this pull request as draft August 24, 2023 16:03
weight := float64(s.period-offset) / float64(s.period)

if weight*float64(s.previous.hits[nodeID])+float64(s.current.hits[nodeID]) < float64(s.limit) {
s.current.hits[nodeID]++
Copy link
Contributor Author

@joshua-kim joshua-kim Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think about if this can overflow. I don't think it can since the amount of hits in a window should always be less than limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is unlikely to overflow an int count given the periods we'll configure. I'd just document that we don't check for overflow here and move on.

Copy link
Contributor

@abi87 abi87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits and requests for clarifications.
Love the reference, makes context super clear

// check if the current evaluation period is over
now := s.clock.Time()
if now.After(s.current.start.Add(s.period)) {
// discard the current window if it's too old
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really to have a correct previous window right?
Current will become previous in this if. Only if system has been very quiet for some time, we should reset previous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We always want to rotate out our current window to the previous window. If the current window is too old however (i.e we haven't seen any traffic in a while), we need to reset the current window before rotating it to avoid over-throttling the node.


func TestSlidingWindowThrottlerHandle(t *testing.T) {
period := time.Minute
previousWindowStartTime := time.Time{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could this be time.Now(). I like it slightly changes from test to test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't do time.Now() to avoid the variance in tests. I don't think it matters in this test specifically but I just generally try to avoid using current time/rng/anything that might change from test-to-test in unit tests. I think for something like an integration test it's fine to do though.

}

// NewSlidingWindowThrottler returns a new instance of SlidingWindowThrottler
func NewSlidingWindowThrottler(period time.Duration, limit int) *SlidingWindowThrottler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we return Throttler interface instead of the struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied below


// SlidingWindowThrottler is an implementation of the sliding window throttling
// algorithm.
type SlidingWindowThrottler struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to export this struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to. We could return the interface but my personal opinion is that it's good to just return the struct so the caller can choose whether to use it as a struct or as an interface. Returning as an interface gets rid of what the underlying type is to the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept interface, return structs

joshua-kim and others added 3 commits August 28, 2023 13:46
Co-authored-by: Ceyhun Onur <ceyhun.onur@avalabs.org>
Signed-off-by: Joshua Kim <20001595+joshua-kim@users.noreply.github.com>
offset := now.Sub(s.current.start)
weight := float64(s.period-offset) / float64(s.period)

if weight*float64(s.previous.hits[nodeID])+float64(s.current.hits[nodeID]) < float64(s.limit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't weight apply to the current period? if so, should we add a test case to cover this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could have documented this better. The weight is applied to only the previous window. When we calculate how many calls a node issued in an evaluation period, we have to add the current + weighted sum of the previous window. This is because the current window is always ongoing, so if we're 75% through the current window, we know we still have to add up 25% of the previous window. The current window's number is always accurate, but the previous window is always an estimate since we throw a weight at it that assumes uniform distribution of traffic in the window. Cloudflare's blog post shows that this assumption is good enough and it only results in mis-throttling in less than a fraction of a percent of cases.

joshua-kim and others added 3 commits August 28, 2023 23:43
Co-authored-by: Darioush Jalali <darioush.jalali@avalabs.org>
Signed-off-by: Joshua Kim <20001595+joshua-kim@users.noreply.github.com>
// Handle returns true if the amount of calls received in the last [s.period]
// time is less than [s.limit]
//
// This is calculated by adding the current period's count to a weighted count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would probably be clearer if we phrase this as "estimated count of the previous period" instead of "weighted count".
I can see how weight can make us think of a linear combination of current period and previous period (which we don't have since current period is "weighted by physics"). Stil maybe use estimate instead of weight may help readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct in that it's an estimate, but it's an estimate of the amount of calls made in X% of the previous window (so kind of like a partial estimate?)... so I felt like a weighted sum was more correct than calling it an estimate. When I wrote this up I was unhappy at the wording but wasn't sure how to make it any more clear.

}

offset := now.Sub(s.current.start)
weight := float64(s.period-offset) / float64(s.period)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify / enforce that s.period > 0 to avoid division by 0?

Copy link
Contributor Author

@joshua-kim joshua-kim Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could validate that period and limit are both positive ints at creation for sure. I feel like it's something that's pretty much always gonna be a hard-coded constant and look wrong (passing in -time.Second sure does look strange...) and returning an error is annoying which is why I didn't do it.

For now I'll just document that it should be positive? If you feel strongly I can return an error in the constructor but I felt like maybe it was overly verbose.

Copy link
Contributor

@StephenButtolph StephenButtolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@StephenButtolph StephenButtolph added this to the v1.10.10 milestone Aug 29, 2023
@StephenButtolph StephenButtolph merged commit a476781 into dev Aug 30, 2023
@StephenButtolph StephenButtolph deleted the sdk-throttler branch August 30, 2023 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sdk This involves SDK tooling or frameworks
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants