Skip to content
This repository was archived by the owner on Feb 22, 2022. It is now read-only.

[incubator/vault] Use httpGet instead of TCP Socket for liveness check #9462

Merged
merged 3 commits into from
Mar 7, 2019
Merged

[incubator/vault] Use httpGet instead of TCP Socket for liveness check #9462

merged 3 commits into from
Mar 7, 2019

Conversation

jbialy
Copy link
Contributor

@jbialy jbialy commented Nov 21, 2018

What this PR does / why we need it:

Currently, the liveness probe uses tcpSocket when determining if the pod is alive or not. This works OK when Vault is listening on HTTP. Although, when TLS is enabled, the liveness probe continues to work but a warning is logged for every check interval:

2018-06-12 14:25:47.867270 I | http: TLS handshake error from 100.127.0.0:16184: EOF 
...

This PR changes the liveness probe to use httpGet where the protocol scheme can be configured to HTTPS depending on the values specified in values.yaml. This avoids filling the logs with the above warning.

The probe checks Vault's /v1/sys/health endpoint in accordance with:

https://www.vaultproject.io/api/system/health.html#standbyok

Checklist

  • DCO signed
  • Chart Version bumped
  • Variables are documented in the README.md

@helm-bot helm-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 21, 2018
@jbialy
Copy link
Contributor Author

jbialy commented Nov 21, 2018

/assign @linki

@helm-bot helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). labels Nov 21, 2018
@jbialy
Copy link
Contributor Author

jbialy commented Dec 5, 2018

/assign @seanknox

@helm-bot helm-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 17, 2018
@jbialy
Copy link
Contributor Author

jbialy commented Dec 17, 2018

👀 PTAL

@jbialy
Copy link
Contributor Author

jbialy commented Dec 27, 2018

/assign

@jbialy
Copy link
Contributor Author

jbialy commented Jan 4, 2019

There seem to be incubator/vault PRs that are 2 months old, are these still being looked at / considered?

@cpuspellcaster
Copy link

My team is encountering a slightly different issue that would have the same fix: On EKS, with the tcpSocket configuration, the liveness probe configures a AWS ELB health check to perform a TCP ping on the API endpoint. However, the health checks are failing even the service is healthy (when checked via port-forward). (The nearest explanation that I can find is the TCP ping times out because the socket connection is kept while the service waits for a HTTP request.)

Regardless, switching to httpGet would resolve our issue.

@helm-bot helm-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 16, 2019
@jbialy
Copy link
Contributor Author

jbialy commented Jan 16, 2019

Bump. Rebased, PTAL 👀 🙇

@helm-bot helm-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 21, 2019
@jbialy
Copy link
Contributor Author

jbialy commented Feb 4, 2019

/assign @viglesiasce

@stale
Copy link

stale bot commented Mar 6, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2019
@mattfarina
Copy link
Contributor

cc @jpds

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2019
@mattfarina
Copy link
Contributor

/ok-to-test

@helm-bot helm-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 7, 2019
@mattfarina
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbialy, mattfarina

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2019
@k8s-ci-robot k8s-ci-robot merged commit 0ecd993 into helm:master Mar 7, 2019
@jbialy jbialy deleted the vault-liveness-check branch March 7, 2019 18:38
Copy link

@mkozjak mkozjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not right. You shouldn't kill a pod when you receive an error like 501 or 503, since it means Vault is in a sealed state.
I'm specifically getting a 503, since it's a fresh installation, obviously.

https://www.vaultproject.io/api/system/health.html

So with 0.15.0 I cannot unseal Vault since it gets restarted, ending up in a CrashLoopBackOff.

Check describe output:

Warning  Unhealthy  3m (x7 over 4m)  kubelet, 159.69.17.129  \
Liveness probe failed: HTTP probe failed with statuscode: 503

Warning  Unhealthy  3m (x7 over 4m)  kubelet, 159.69.17.129  \
Readiness probe failed: HTTP probe failed with statuscode: 503

@jbialy
Copy link
Contributor Author

jbialy commented Mar 9, 2019

Ah, good catch! I didn't foresee this. One way to fix it would be to change the sealedcode and uninitcode to return 200 response code.

@mkozjak
Copy link

mkozjak commented Mar 9, 2019

@jbialy In upstream? Hmm, doubt it that they'll do it easily. Reverting back to 0.14.9, locally.
I guess you should revert this bugfix, too, as it'll only work in dev env.

@jbialy
Copy link
Contributor Author

jbialy commented Mar 9, 2019

sealedcode and uninitcode are params that can be passed to the /health endpoint and can be used to overwrite the response from Vault, so it should do the job, the problem is that the review process for charts takes a while. I had this PR open for about four months.

alvgarvilla pushed a commit to alvgarvilla/charts that referenced this pull request Mar 13, 2019
helm#9462)

* use httpGet instead of TCP socket

Signed-off-by: Janusz Bialy <janusz.bialy@qlik.com>

* bump chart version

Signed-off-by: Janusz Bialy <jbialy@gmail.com>
rmccorm4 pushed a commit to rmccorm4/charts that referenced this pull request Mar 26, 2019
helm#9462)

* use httpGet instead of TCP socket

Signed-off-by: Janusz Bialy <janusz.bialy@qlik.com>

* bump chart version

Signed-off-by: Janusz Bialy <jbialy@gmail.com>
devnulled pushed a commit to devnulled/charts that referenced this pull request Apr 25, 2019
helm#9462)

* use httpGet instead of TCP socket

Signed-off-by: Janusz Bialy <janusz.bialy@qlik.com>

* bump chart version

Signed-off-by: Janusz Bialy <jbialy@gmail.com>
@jbialy jbialy mentioned this pull request Jun 18, 2019
3 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). lgtm Indicates that a PR is ready to be merged. ok-to-test size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.