Increase timeout to 80s. Backoff poll frequency. by garloff · Pull Request #902 · SovereignCloudStack/standards

garloff · 2025-04-10T15:46:57Z

Timeout was 60s, increase to 80s. This may help on heavily loaded storage environments.

We see ~60 debug messages for each timed out wait. This is more than needed and puts a bit more load on the control plane than needed. So change the approach: Start with 0.5s polling freq but increase wait time by 0.1s each time, slowly backing off. So we produce only 35 debug lines and API calls before running into the timeout.

Sidenote: Exponential backoff is a well-known approach to deal with congestion. We kept it simple (linear back-off), as our timeout is not huge.

garloff · 2025-04-10T15:47:59Z

This is a minor improvement only.
Won't address the fails on cnds, maybe more luck on aov.

mbuechse · 2025-04-22T12:57:15Z

To be frank, I don't really like that we use polling here. But if it must be, maybe we can decide in advance what a reasonable amount of time for this operation would be like. And then we might use an approach such as this (taken from an earlier version of the entropy test):

def retry(func, exc_type, timeouts=(8, 7, 15, 10, 20, 30, 60)):
    if isinstance(exc_type, str):
        exc_type = exc_type.split(',')
    timeout_iter = iter(timeouts)
    # do an initial sleep because func is known fail at first anyway
    time.sleep(next(timeout_iter))
    retries = 0
    while True:
        try:
            func()
        except Exception as e:
            retries += 1
            timeout = next(timeout_iter, None)
            if timeout is None or e.__class__.__name__ not in exc_type:
                raise
            logger.debug(f"Initiating retry in {timeout} s due to {e!r} during {func!r}")
            time.sleep(timeout)
        else:
            break
    if retries:
        logger.debug(f"Operation {func!r} successful after {retries} retries")

It uses a sequence of times instead of a fixed function. Depending on what time we expect the backup to take, we could start with a high initial number, such as 30, and then continue with 15, 10, and then increase again, maybe 15, 20. In sum, this would be 90 seconds?

garloff · 2025-04-22T15:04:55Z

Well, it's hard to predict the time it will take.
As the storage system can do things like copy-on-write for backups, this may well be done within 2s (which is basically the call overhead).
If on the other hand, the backup cluster is separated from the main storage, it will have to copy over data. For 1GB, this could reasonably take up to 10s (100MB/s effective copy rate). Now in a busy system, there may be congestion or internal retries, so we should not give up too early.
So we could do some waiting, 2+3+5+10+15+25+40 = 100s and then consider it failed.

Timeout was 60s, increase to 80s. This may help on heavily loaded storage environments. We see ~60 debug messages for each timed out wait. This is more than needed and puts a bit more load on the control plane than needed. So change the approach: Start with 0.5s polling freq but increase wait time by 0.1s each time, slowly backing off. So we produce only 35 debug lines and API calls before running into the timeout. Sidenote: Exponential backoff is a well-known approach to deal with congestion. We kept it simple (linear back-off), as our timeout is not huge. Signed-off-by: Kurt Garloff <kurt@garloff.de>

make this consistent with the other tests; prevent confident info from being disclosed Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>

Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>

garloff requested a review from mbuechse April 10, 2025 15:46

garloff self-assigned this Apr 10, 2025

garloff added enhancement New feature or request standards Issues / ADR / pull requests relevant for standardization & certification labels Apr 10, 2025

mbuechse force-pushed the feat/backoff-pollfreq-adjust-timeout-volume-backup branch from d5da9ce to 9d721f2 Compare April 29, 2025 12:10

mbuechse added 3 commits April 29, 2025 14:11

Do not enable debug logging for OpenStack

9185160

make this consistent with the other tests; prevent confident info from being disclosed Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>

Use timeout ladder

ee5d30b

Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>

General code improvements, better detection of cinder-backup missing

3fb9323

Signed-off-by: Matthias Büchse <matthias.buechse@alasca.cloud>

mbuechse approved these changes Apr 29, 2025

View reviewed changes

mbuechse merged commit 2e182fb into main Apr 29, 2025
8 checks passed

mbuechse deleted the feat/backoff-pollfreq-adjust-timeout-volume-backup branch April 29, 2025 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeout to 80s. Backoff poll frequency.#902

Increase timeout to 80s. Backoff poll frequency.#902
mbuechse merged 4 commits intomainfrom
feat/backoff-pollfreq-adjust-timeout-volume-backup

garloff commented Apr 10, 2025

Uh oh!

garloff commented Apr 10, 2025

Uh oh!

mbuechse commented Apr 22, 2025 •

edited

Loading

Uh oh!

garloff commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garloff commented Apr 10, 2025

Uh oh!

garloff commented Apr 10, 2025

Uh oh!

mbuechse commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garloff commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbuechse commented Apr 22, 2025 •

edited

Loading