Skip to content

api: Use exponential backoff in call_on_each_event. #586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

neiljp
Copy link
Contributor

@neiljp neiljp commented Apr 18, 2020

A long-term TODO here has been to move from a 1s pause after each failure towards an exponential backoff approach, as proposed here.

@timabbott See last_event_id handling topic on #zulip-terminal from last August.

I'm unsure of the best parameter set to use here; I've taken values from other uses of the backoff and filled the default values in explicitly to aid in reasoning.

I'm not sure if we want to switch to backoff.keep_going but rather stay with the previous while True approach, since this is essentially an unending event loop.

# NOTE: Back off exponentially to cover against potential bugs in this
# library causing a DoS attack against a server when getting errors
# (explicit values listed for clarity)
backoff = RandomExponentialBackoff(maximum_retries=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can go with maximum_retries=50, i.e. basically arranging it so that it will eventually give up, but only if it's really clear the server is down for good and not just having temporary downtime.

I also notice there's a time.sleep(1) in do_register; probably that should be changed as well.

(Or maybe do_register should be folded into this loop?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the behavior is changed from before, which is a little concerning, though arguably useful - the handler will return after some time, rather than be in an infinite loop, so the caller can decide whether to retry again?

We could put the backoff parameters into the method call, or something like them; that could also enable behavior like previously (regarding the infinite loop), but with backoff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct to potentially fail, so the caller knows something is horribly wrong. We have a bunch of clients that are forever looping and getting 401s during registration, and I don't think that their owners have any sign that anything is going wrong.

I think we should raise an exception if we hit the max retries, since existing code is likely not expecting the function to ever return. Raising an exception means it won't unexpectedly fall through to other code.

This behavior change certainly merits a documentation update as well.

@neiljp neiljp force-pushed the 2020-04-exponential-backoff-in-call_on_each_event branch from 36588c4 to dc4404f Compare April 26, 2020 08:27
@zulipbot zulipbot added size: M and removed size: S labels Apr 26, 2020
@neiljp
Copy link
Contributor Author

neiljp commented Apr 26, 2020

@timabbott I've added some followup commits which first do backoff for the do_register as well, and then inline it like you mentioned, which I think is cleaner. I'm not sure whether we want the intermediate refactoring or not.

There's also a 1s sleep in the error_retry in do_api_query, but I'm not sure if we want that to have backoff too?

# library causing a DoS attack against a server when getting errors
# (explicit values listed for clarity)
backoff = RandomExponentialBackoff(maximum_retries=10,
timeout_success_equivalent=300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a timeout_success_equivalent on this. That's for cases where we don't have the ability to put an explicit success() call in, and we want to be able to reset our backoff after 5 minutes on the assumption that that's a "success." Here, we have places we can mark as successful. Having a timeout_success_equivalent muddies the logic -- and I think won't kick in unless we have 5-minute HTTP requests, which should really not be treated as successes. :)

@alexmv
Copy link
Contributor

alexmv commented Aug 1, 2020

There's also a 1s sleep in the error_retry in do_api_query, but I'm not sure if we want that to have backoff too?

I think we certainly should. If we start to fall over due to overload and start returning 5xx, I don't think we want every client switching from 1 request every 10s (longpoll) to every second -- which is what the current logic does.

@timabbott
Copy link
Member

@neiljp do you have time to update this? With #611 I'd like to get this resolve and then do a release.

@zulipbot
Copy link
Member

Heads up @neiljp, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the upstream/main branch and resolve your pull request's merge conflicts accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants