-
-
Notifications
You must be signed in to change notification settings - Fork 392
api: Use exponential backoff in call_on_each_event. #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
api: Use exponential backoff in call_on_each_event. #586
Conversation
# NOTE: Back off exponentially to cover against potential bugs in this | ||
# library causing a DoS attack against a server when getting errors | ||
# (explicit values listed for clarity) | ||
backoff = RandomExponentialBackoff(maximum_retries=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can go with maximum_retries=50
, i.e. basically arranging it so that it will eventually give up, but only if it's really clear the server is down for good and not just having temporary downtime.
I also notice there's a time.sleep(1)
in do_register
; probably that should be changed as well.
(Or maybe do_register
should be folded into this loop?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, the behavior is changed from before, which is a little concerning, though arguably useful - the handler will return after some time, rather than be in an infinite loop, so the caller can decide whether to retry again?
We could put the backoff parameters into the method call, or something like them; that could also enable behavior like previously (regarding the infinite loop), but with backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct to potentially fail, so the caller knows something is horribly wrong. We have a bunch of clients that are forever looping and getting 401s during registration, and I don't think that their owners have any sign that anything is going wrong.
I think we should raise an exception if we hit the max retries, since existing code is likely not expecting the function to ever return. Raising an exception means it won't unexpectedly fall through to other code.
This behavior change certainly merits a documentation update as well.
Previously we paused 1s after each failure.
36588c4
to
dc4404f
Compare
@timabbott I've added some followup commits which first do backoff for the There's also a 1s sleep in the |
# library causing a DoS attack against a server when getting errors | ||
# (explicit values listed for clarity) | ||
backoff = RandomExponentialBackoff(maximum_retries=10, | ||
timeout_success_equivalent=300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want a timeout_success_equivalent
on this. That's for cases where we don't have the ability to put an explicit success()
call in, and we want to be able to reset our backoff after 5 minutes on the assumption that that's a "success." Here, we have places we can mark as successful. Having a timeout_success_equivalent
muddies the logic -- and I think won't kick in unless we have 5-minute HTTP requests, which should really not be treated as successes. :)
I think we certainly should. If we start to fall over due to overload and start returning 5xx, I don't think we want every client switching from 1 request every 10s (longpoll) to every second -- which is what the current logic does. |
Heads up @neiljp, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the |
A long-term TODO here has been to move from a 1s pause after each failure towards an exponential backoff approach, as proposed here.
@timabbott See
last_event_id handling
topic on #zulip-terminal from last August.I'm unsure of the best parameter set to use here; I've taken values from other uses of the backoff and filled the default values in explicitly to aid in reasoning.
I'm not sure if we want to switch to
backoff.keep_going
but rather stay with the previouswhile True
approach, since this is essentially an unending event loop.