Skip to content

Increase timeout to 80s. Backoff poll frequency.#902

Merged
mbuechse merged 4 commits intomainfrom
feat/backoff-pollfreq-adjust-timeout-volume-backup
Apr 29, 2025
Merged

Increase timeout to 80s. Backoff poll frequency.#902
mbuechse merged 4 commits intomainfrom
feat/backoff-pollfreq-adjust-timeout-volume-backup

Conversation

@garloff
Copy link
Member

@garloff garloff commented Apr 10, 2025

Timeout was 60s, increase to 80s. This may help on heavily loaded storage environments.

We see ~60 debug messages for each timed out wait. This is more than needed and puts a bit more load on the control plane than needed. So change the approach: Start with 0.5s polling freq but increase wait time by 0.1s each time, slowly backing off. So we produce only 35 debug lines and API calls before running into the timeout.

Sidenote: Exponential backoff is a well-known approach to deal with congestion. We kept it simple (linear back-off), as our timeout is not huge.

@garloff garloff requested a review from mbuechse April 10, 2025 15:46
@garloff garloff self-assigned this Apr 10, 2025
@garloff garloff added enhancement New feature or request standards Issues / ADR / pull requests relevant for standardization & certification labels Apr 10, 2025
@garloff
Copy link
Member Author

garloff commented Apr 10, 2025

This is a minor improvement only.
Won't address the fails on cnds, maybe more luck on aov.

@mbuechse
Copy link
Contributor

mbuechse commented Apr 22, 2025

To be frank, I don't really like that we use polling here. But if it must be, maybe we can decide in advance what a reasonable amount of time for this operation would be like. And then we might use an approach such as this (taken from an earlier version of the entropy test):

def retry(func, exc_type, timeouts=(8, 7, 15, 10, 20, 30, 60)):
    if isinstance(exc_type, str):
        exc_type = exc_type.split(',')
    timeout_iter = iter(timeouts)
    # do an initial sleep because func is known fail at first anyway
    time.sleep(next(timeout_iter))
    retries = 0
    while True:
        try:
            func()
        except Exception as e:
            retries += 1
            timeout = next(timeout_iter, None)
            if timeout is None or e.__class__.__name__ not in exc_type:
                raise
            logger.debug(f"Initiating retry in {timeout} s due to {e!r} during {func!r}")
            time.sleep(timeout)
        else:
            break
    if retries:
        logger.debug(f"Operation {func!r} successful after {retries} retries")

It uses a sequence of times instead of a fixed function. Depending on what time we expect the backup to take, we could start with a high initial number, such as 30, and then continue with 15, 10, and then increase again, maybe 15, 20. In sum, this would be 90 seconds?

@garloff
Copy link
Member Author

garloff commented Apr 22, 2025

Well, it's hard to predict the time it will take.
As the storage system can do things like copy-on-write for backups, this may well be done within 2s (which is basically the call overhead).
If on the other hand, the backup cluster is separated from the main storage, it will have to copy over data. For 1GB, this could reasonably take up to 10s (100MB/s effective copy rate). Now in a busy system, there may be congestion or internal retries, so we should not give up too early.
So we could do some waiting, 2+3+5+10+15+25+40 = 100s and then consider it failed.

Timeout was 60s, increase to 80s. This may help on heavily loaded
storage environments.

We see ~60 debug messages for each timed out wait. This is more than
needed and puts a bit more load on the control plane than needed.
So change the approach: Start with 0.5s polling freq but increase
wait time by 0.1s each time, slowly backing off. So we produce only
35 debug lines and API calls before running into the timeout.

Sidenote: Exponential backoff is a well-known approach to deal
with congestion. We kept it simple (linear back-off), as our timeout
is not huge.

Signed-off-by: Kurt Garloff <kurt@garloff.de>
@mbuechse mbuechse force-pushed the feat/backoff-pollfreq-adjust-timeout-volume-backup branch from d5da9ce to 9d721f2 Compare April 29, 2025 12:10
make this consistent with the other tests; prevent confident info from being disclosed

Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>
Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>
Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>
@mbuechse mbuechse merged commit 2e182fb into main Apr 29, 2025
8 checks passed
@mbuechse mbuechse deleted the feat/backoff-pollfreq-adjust-timeout-volume-backup branch April 29, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request standards Issues / ADR / pull requests relevant for standardization & certification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants