Closed
Description
Describe the bug
When vault is deployed in a cluster, using etcd for HA management, losing the first etcd unit in the 'address' list results in vault returning 500s
To Reproduce
- Deploy 3 units of vault. In my test case use 1 mysql unit for storage and 3 units of etcd for ha_storage.
- Initialise and unseal vault and validate API with "vault status"
- Shutdown the etcd unit that corresponds to the first IP in the 'address' list (in the 'ha_storage "etcd" ' section)
- Run "vault status" with VAULT_ADDR pointing at any of the vault units
Error checking leader status: Error making API request.
URL: GET http://10.53.82.200:8200/v1/sys/leader
Code: 500. Errors:
* context deadline exceeded
Expected behavior
I would expect vault to continue to function and respond to API requests, without interruption, in the event of losing an etcd unit.
Environment:
- Vault Server Version: 0.10.3
- Vault Client Version: 0.10.1
- Server OS: Ubuntu 16.04.5 (Xenial) x86_64
Vault server configuration file(s):
api_addr = "http://10.53.82.113:8200"
cluster_addr = "http://10.53.82.113:8201"
disable_mlock = true
storage "mysql" {
username = "vault"
password = "kpzLkcjfmX9w45n542zwLyrJBppfg5rP"
database = "vault"
address = "10.53.82.226:3306"
}
ha_storage "etcd" {
ha_enabled = "true"
address = "https://10.53.82.119:2379,https://10.53.82.150:2379,https://10.53.82.157:2379"
tls_ca_file = "/var/snap/vault/common/etcd-ca.pem"
tls_cert_file = "/var/snap/vault/common/etcd-cert.pem"
tls_key_file = "/var/snap/vault/common/etcd.key"
etcd_api = "v3"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
# Localhost only listener for charm access to vault.
listener "tcp" {
address = "127.0.0.1:8220"
tls_disable = 1
}
Additional context
On the active vault unit the following entries appear in its log when the etcd unit goes down:
2018-07-20T08:24:16.614Z [WARN ] core: leadership lost, stopping active operation
2018-07-20T08:24:16.614Z [INFO ] core: pre-seal teardown starting
2018-07-20T08:24:16.614Z [INFO ] core: stopping cluster listeners
2018-07-20T08:24:16.614Z [INFO ] core: shutting down forwarding rpc listeners
2018-07-20T08:24:16.614Z [INFO ] core: forwarding rpc listeners stopped
2018-07-20T08:24:16.907Z [INFO ] core: rpc listeners successfully shut down
2018-07-20T08:24:16.907Z [INFO ] core: cluster listeners successfully shut down
2018-07-20T08:24:16.907Z [INFO ] rollback: stopping rollback manager
2018-07-20T08:24:16.908Z [INFO ] core: pre-seal teardown complete
The standby vault units do not log anything as a result of the lost etcd unit.