You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).
To Reproduce
Hard to reproduce as it depends on a third party being unavailable.
Expected behavior
The golang GCS client library will retry as it's considered a retryble operation and status code
Environment:
Vault Server Version (retrieve with vault status): 1.15.4
Vault CLI Version (retrieve with vault version): v.1.15.6 (irrelevant)
Server Operating System/Architecture: Vault cluster 3 replicas running on kubernetes using GCS as the backend and kms to encrypt the keys
Additional context
The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:
deferfunc() {
// Turn off restore mode. We can do this safely without the lock because// if restore mode finished successfully, restore mode was already// disabled with the lock. In an error state, this will allow the// Stop() function to shut everything down.atomic.StoreInt32(m.restoreMode, 0)
switch {
caseretErr==nil:
casestrings.Contains(retErr.Error(), context.Canceled.Error()):
// Don't run error func because we lost leadershipm.logger.Warn("context canceled while restoring leases, stopping lease loading")
retErr=nilcaseerrwrap.Contains(retErr, ErrBarrierSealed.Error()):
// Don't run error func because we're likely already shutting downm.logger.Warn("barrier sealed while restoring leases, stopping lease loading")
retErr=nildefault:
m.logger.Error("error restoring leases", "error", retErr)
iferrorFunc!=nil {
errorFunc()
}
}
}()
Google Cloud Support confirmed we're not receiving a lot of 503s and also our monitoring confirms the same, during the time window where the loop of:
for i in [0,1,2]:
- acquire lock
- restore leases
- fail to read lease with 503
We only see 3 503s, 1 per replica.
The text was updated successfully, but these errors were encountered:
Describe the bug
When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).
To Reproduce
Hard to reproduce as it depends on a third party being unavailable.
Expected behavior
The golang GCS client library will retry as it's considered a retryble operation and status code
Environment:
vault status
): 1.15.4vault version
): v.1.15.6 (irrelevant)Vault server configuration file(s):
Additional context
The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:
We only see 3 503s, 1 per replica.
The text was updated successfully, but these errors were encountered: