Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul checks showing standby vault instance as active and Check serfHealth is in critical state #10463

Open
vikramhansawat opened this issue Jun 23, 2021 · 1 comment
Labels
theme/consul-vault Relating to Consul & Vault interactions theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@vikramhansawat
Copy link

vikramhansawat commented Jun 23, 2021

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

We are using Vault HA with consul KV as backend. Quite often, we are facing issue like Check serfHealth is in critical state. This is happening intermitently as causing the vault cluster to be unavailable for few seconds.

Reproduction Steps

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 1
	checks = 1
	services = 2
build:
	prerelease = 
	revision = d149d7e9
	version = 1.7.4
consul:
	acl = enabled
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 67
	max_procs = 4
	os = linux
	version = go1.13.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 48
	failed = 79
	health_score = 0
	intent_queue = 0
	left = 51
	member_time = 121470
	members = 428
	query_queue = 0
	query_time = 1

Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 2
build:
	prerelease = 
	revision = d149d7e9
	version = 1.7.4
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.60.1.7:8300
	server = true
raft:
	applied_index = 144198015
	commit_index = 144198015
	fsm_pending = 0
	last_contact = 0
	last_log_index = 144198016
	last_log_term = 182
	last_snapshot_index = 144188744
	last_snapshot_term = 182
	latest_configuration = [{Suffrage:Voter ID:ec2f4346-d696-5ceb-80c0-6ef827b733be Address:10.60.151.7:8300} {Suffrage:Voter ID:e7159002-5dd9-5c64-87ff-77d8c383f86f Address:10.60.116:8300} {Suffrage:Voter ID:d9e1ff29-5c17-5678-8b77-e80ecbb40e88 Address:10.60.1.4:8300} {Suffrage:Voter ID:453e702b-70e6-5b04-8472-e745128e736a Address:10.60.1.3:8300} {Suffrage:Voter ID:7c5fa0e5-74ec-5f8e-a6ab-f4b876649f7f Address:10.60.1.5:8300}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 182
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 3928
	max_procs = 4
	os = linux
	version = go1.13.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 48
	failed = 75
	health_score = 0
	intent_queue = 0
	left = 52
	member_time = 121470
	members = 428
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 182
	members = 5
	query_queue = 0
	query_time = 1

Operating system and Environment details

Linux prd-consul-server-1 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

$ curl -v http://localhost:8500/v1/agent/checks | jq .

{
  "infrastructure-core-vault:prd-vault:8200:vault-sealed-check": {
    "Node": "prd-vault-vault-2.fflive.local",
    "CheckID": "infrastructure-core-vault:prd-vault:8200:vault-sealed-check",
    "Name": "Vault Sealed Status",
    "Status": "passing",
    "Notes": "Vault service is healthy when Vault is in an unsealed status and can become an active Vault server",
    "Output": "Vault Unsealed",
    "ServiceID": "infrastructure-core-vault:prd-vault:8200",
    "ServiceName": "infrastructure-core-vault",
    "ServiceTags": [
      "vault_version=1.2.2",
      "active"
    ],
    "Type": "ttl",
    "Definition": {},
    "CreateIndex": 0,
    "ModifyIndex": 0
  }
}
$ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.2.2
Cluster Name             vault-cluster-29b9195a
Cluster ID               63028197-2703-ebfe-44bf-6140c2104c8f
HA Enabled               true
HA Cluster               https://10.131.3.6:8201
HA Mode                  standby
Active Node Address      http://prd-vault.ff.net:8200
$ agent log
Jun 23 04:50:17 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T04:50:17.663Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.60.151.16:8300 error="rpc error making call: rpc error making call: invalid session "f839a61a-b03d-46bb-0458-ff6bcfbde51e""
Jun 23 04:50:17 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T04:50:17.663Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/infrastructure/vault/core/lock?acquire=f839a61a-b03d-46bb-0458-ff6bcfbde51e&flags=3304740253564472344 from=172.17.0.2:52004 error="rpc error making call: rpc error making call: invalid session "f839a61a-b03d-46bb-0458-ff6bcfbde51e""
Jun 23 05:16:27 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T05:16:27.106Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.60.151.14:8300 error="rpc error making call: rpc error making call: invalid session "25a58f48-a66e-3d73-8021-0ebe0ba7ff93""
Jun 23 05:16:27 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T05:16:27.106Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/infrastructure/vault/core/lock?acquire=25a58f48-a66e-3d73-8021-0ebe0ba7ff93&flags=3304740253564472344 from=172.17.0.2:52004 error="rpc error making call: rpc error making call: invalid session "25a58f48-a66e-3d73-8021-0ebe0ba7ff93""
Jun 23 06:35:56 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T06:35:56.652Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.60.151.7:8300 error="rpc error making call: invalid session "e57069d2-cbdf-890f-ef6d-affaf33ce861""
Jun 23 06:35:56 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T06:35:56.652Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/infrastructure/vault/core/lock?acquire=e57069d2-cbdf-890f-ef6d-affaf33ce861&flags=3304740253564472344 from=172.17.0.2:56352 error="rpc error making call: invalid session "e57069d2-cbdf-890f-ef6d-affaf33ce861""
Jun 23 07:22:26 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T07:22:26.380Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.60.151.15:8300 error="rpc error making call: rpc error making call: Check 'serfHealth' is in critical state"
Jun 23 07:22:26 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T07:22:26.380Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.17.0.2:47562 error="rpc error making call: rpc error making call: Check 'serfHealth' is in critical state"
Jun 23 09:44:05 we1-prd-infrastructure-vault-vault-2 consul[10606]: 2021-06-23T09:44:05.300Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.60.151.16:8300 error="rpc error making call: rpc error making call: Check 'serfHealth' is in critical state"
@vikramhansawat
Copy link
Author

vikramhansawat commented Jun 23, 2021

vault config file:
listener "tcp" {
  address          = "0.0.0.0:8200"
  cluster_address  = "0.0.0.0:8201"
  tls_disable      = "true"
}

listener "tcp" {
  address     = "0.0.0.0:8202"
  tls_cert_file = "certificate.pem"
  tls_key_file = "key.pem"
  tls_disable_client_certs = "true"
}

storage "azure" {
  accountName = "xxxxxxxx"
  accountKey  = "xxxxxxxxx"
  container   = "xxxxxxxx"
  environment = "AzurePublicCloud"
}

ha_storage "consul" {
  address = "10.131.3.9:8500"
  path    = "infrastructure/vault/"
  service = "infrastructure-core-vault"
  service_tags = "vault_version=1.2.2"
  token   = "xxxx"
}

seal "azurekeyvault" {
  client_id      = "xxxxxxxxxxxxxxx"
  client_secret  = "xxxxxxxxxxxxx"
  tenant_id      = "xxxxxxxxxx"
  environment    = "AzurePublicCloud"
  vault_name     = "xxxxxxxxx"
  key_name       = "xxxxxxxxx"
}

api_addr = "http://prd-vault.ff.net:8200"
disable_mlock = true

max_lease_ttl = "17520h"
ui = true

telemetry {
  prometheus_retention_time = "30s",
  disable_hostname = true
}

log_format = "json"
log_level = "Info"

@vikramhansawat vikramhansawat changed the title consul checks showing standby vault instance as active consul checks showing standby vault instance as active and Check serfHealth is in critical state Jun 23, 2021
@jsosulska jsosulska added theme/consul-vault Relating to Consul & Vault interactions theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp labels Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/consul-vault Relating to Consul & Vault interactions theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

2 participants