api: Use exponential backoff in call_on_each_event. #586

neiljp · 2020-04-18T05:55:00Z

A long-term TODO here has been to move from a 1s pause after each failure towards an exponential backoff approach, as proposed here.

@timabbott See last_event_id handling topic on #zulip-terminal from last August.

I'm unsure of the best parameter set to use here; I've taken values from other uses of the backoff and filled the default values in explicitly to aid in reasoning.

I'm not sure if we want to switch to backoff.keep_going but rather stay with the previous while True approach, since this is essentially an unending event loop.

timabbott · 2020-04-18T19:13:47Z

zulip/zulip/__init__.py

+        # NOTE: Back off exponentially to cover against potential bugs in this
+        #       library causing a DoS attack against a server when getting errors
+        #       (explicit values listed for clarity)
+        backoff = RandomExponentialBackoff(maximum_retries=10,


Maybe we can go with maximum_retries=50, i.e. basically arranging it so that it will eventually give up, but only if it's really clear the server is down for good and not just having temporary downtime.

I also notice there's a time.sleep(1) in do_register; probably that should be changed as well.

(Or maybe do_register should be folded into this loop?)

With this PR, the behavior is changed from before, which is a little concerning, though arguably useful - the handler will return after some time, rather than be in an infinite loop, so the caller can decide whether to retry again?

We could put the backoff parameters into the method call, or something like them; that could also enable behavior like previously (regarding the infinite loop), but with backoff?

I think it is correct to potentially fail, so the caller knows something is horribly wrong. We have a bunch of clients that are forever looping and getting 401s during registration, and I don't think that their owners have any sign that anything is going wrong.

I think we should raise an exception if we hit the max retries, since existing code is likely not expecting the function to ever return. Raising an exception means it won't unexpectedly fall through to other code.

This behavior change certainly merits a documentation update as well.

Previously we paused 1s after each failure.

neiljp · 2020-04-26T10:06:06Z

@timabbott I've added some followup commits which first do backoff for the do_register as well, and then inline it like you mentioned, which I think is cleaner. I'm not sure whether we want the intermediate refactoring or not.

There's also a 1s sleep in the error_retry in do_api_query, but I'm not sure if we want that to have backoff too?

alexmv · 2020-08-01T00:49:14Z

zulip/zulip/__init__.py

+        #       library causing a DoS attack against a server when getting errors
+        #       (explicit values listed for clarity)
+        backoff = RandomExponentialBackoff(maximum_retries=10,
+                                           timeout_success_equivalent=300,


I don't think we want a timeout_success_equivalent on this. That's for cases where we don't have the ability to put an explicit success() call in, and we want to be able to reset our backoff after 5 minutes on the assumption that that's a "success." Here, we have places we can mark as successful. Having a timeout_success_equivalent muddies the logic -- and I think won't kick in unless we have 5-minute HTTP requests, which should really not be treated as successes. :)

alexmv · 2020-08-01T00:53:33Z

There's also a 1s sleep in the error_retry in do_api_query, but I'm not sure if we want that to have backoff too?

I think we certainly should. If we start to fall over due to overload and start returning 5xx, I don't think we want every client switching from 1 request every 10s (longpoll) to every second -- which is what the current logic does.

timabbott · 2020-08-10T23:37:07Z

@neiljp do you have time to update this? With #611 I'd like to get this resolve and then do a release.

zulipbot · 2021-08-24T19:55:16Z

Heads up @neiljp, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the upstream/main branch and resolve your pull request's merge conflicts accordingly.

zulipbot added the size: S label Apr 18, 2020

timabbott reviewed Apr 18, 2020

View reviewed changes

neiljp added 3 commits April 26, 2020 01:08

api: Use exponential backoff in call_on_each_event.

1c88115

Previously we paused 1s after each failure.

api: Extend backoff for register inside call_on_each_event.

292d320

api: Inline nested do_register function in call_on_each_event.

dc4404f

neiljp force-pushed the 2020-04-exponential-backoff-in-call_on_each_event branch from 36588c4 to dc4404f Compare April 26, 2020 08:27

zulipbot added size: M and removed size: S labels Apr 26, 2020

alexmv reviewed Aug 1, 2020

View reviewed changes

alexmv mentioned this pull request Aug 1, 2020

zephyr: Use exponential backoffs in retry loops. #611

Merged

zulipbot added the has conflicts label Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

api: Use exponential backoff in call_on_each_event. #586

api: Use exponential backoff in call_on_each_event. #586

Uh oh!

neiljp commented Apr 18, 2020

Uh oh!

timabbott Apr 18, 2020

Uh oh!

neiljp Apr 26, 2020

Uh oh!

alexmv Aug 1, 2020

Uh oh!

neiljp commented Apr 26, 2020

Uh oh!

alexmv Aug 1, 2020

Uh oh!

alexmv commented Aug 1, 2020

Uh oh!

timabbott commented Aug 10, 2020

Uh oh!

zulipbot commented Aug 24, 2021

Uh oh!

Uh oh!

Uh oh!

api: Use exponential backoff in call_on_each_event. #586

Are you sure you want to change the base?

api: Use exponential backoff in call_on_each_event. #586

Uh oh!

Conversation

neiljp commented Apr 18, 2020

Uh oh!

timabbott Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

neiljp Apr 26, 2020

Choose a reason for hiding this comment

Uh oh!

alexmv Aug 1, 2020

Choose a reason for hiding this comment

Uh oh!

neiljp commented Apr 26, 2020

Uh oh!

alexmv Aug 1, 2020

Choose a reason for hiding this comment

Uh oh!

alexmv commented Aug 1, 2020

Uh oh!

timabbott commented Aug 10, 2020

Uh oh!

zulipbot commented Aug 24, 2021

Uh oh!

Uh oh!