Skip to content

Vault is inaccessible if an etcd unit is lost #4961

Closed
@gnuoy

Description

@gnuoy

Describe the bug
When vault is deployed in a cluster, using etcd for HA management, losing the first etcd unit in the 'address' list results in vault returning 500s

To Reproduce

  • Deploy 3 units of vault. In my test case use 1 mysql unit for storage and 3 units of etcd for ha_storage.
  • Initialise and unseal vault and validate API with "vault status"
  • Shutdown the etcd unit that corresponds to the first IP in the 'address' list (in the 'ha_storage "etcd" ' section)
  • Run "vault status" with VAULT_ADDR pointing at any of the vault units
Error checking leader status: Error making API request.

URL: GET http://10.53.82.200:8200/v1/sys/leader
Code: 500. Errors:

* context deadline exceeded

Expected behavior
I would expect vault to continue to function and respond to API requests, without interruption, in the event of losing an etcd unit.

Environment:

  • Vault Server Version: 0.10.3
  • Vault Client Version: 0.10.1
  • Server OS: Ubuntu 16.04.5 (Xenial) x86_64

Vault server configuration file(s):

api_addr = "http://10.53.82.113:8200"
cluster_addr = "http://10.53.82.113:8201"
disable_mlock = true
storage "mysql" {
  username = "vault"
  password = "kpzLkcjfmX9w45n542zwLyrJBppfg5rP"
  database = "vault"
  address = "10.53.82.226:3306"
}
ha_storage "etcd" {
  ha_enabled = "true"
  address = "https://10.53.82.119:2379,https://10.53.82.150:2379,https://10.53.82.157:2379"
  tls_ca_file = "/var/snap/vault/common/etcd-ca.pem"
  tls_cert_file = "/var/snap/vault/common/etcd-cert.pem"
  tls_key_file = "/var/snap/vault/common/etcd.key"
  etcd_api = "v3"
}
listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 1
}

# Localhost only listener for charm access to vault.
listener "tcp" {
  address = "127.0.0.1:8220"
  tls_disable = 1
}

Additional context

On the active vault unit the following entries appear in its log when the etcd unit goes down:

2018-07-20T08:24:16.614Z [WARN ] core: leadership lost, stopping active operation
2018-07-20T08:24:16.614Z [INFO ] core: pre-seal teardown starting
2018-07-20T08:24:16.614Z [INFO ] core: stopping cluster listeners
2018-07-20T08:24:16.614Z [INFO ] core: shutting down forwarding rpc listeners
2018-07-20T08:24:16.614Z [INFO ] core: forwarding rpc listeners stopped
2018-07-20T08:24:16.907Z [INFO ] core: rpc listeners successfully shut down
2018-07-20T08:24:16.907Z [INFO ] core: cluster listeners successfully shut down
2018-07-20T08:24:16.907Z [INFO ] rollback: stopping rollback manager
2018-07-20T08:24:16.908Z [INFO ] core: pre-seal teardown complete

The standby vault units do not log anything as a result of the lost etcd unit.

Metadata

Metadata

Assignees

Labels

bugUsed to indicate a potential bugha/etcd

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions