Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network Experiment status check done on Wrong Daemon Set pod #516

Open
NishantG01 opened this issue Jun 18, 2021 · 0 comments
Open

Network Experiment status check done on Wrong Daemon Set pod #516

NishantG01 opened this issue Jun 18, 2021 · 0 comments
Labels
type/bug Something isn't working
Milestone

Comments

@NishantG01
Copy link

NishantG01 commented Jun 18, 2021

Issue Description

Type: bug report/question

Describe what happened (or what feature you want)

During the network experiment with timeout set, when the number of target application pods are higher, operator is trying to do the status check on a wrong DS pods and getting following error
{"code":406,"success":false,"error":"data not found"}

Describe what you expected to happen

Experiment should start and after time out period, operator should trigger a destroy command to Daemon set pods after the status check done.

How to reproduce it (as minimally and precisely as possible)

  1. Inject network latency on 50 pods with evict-percent 100

Tell us your environment

K8s environment - application with 50 pods

Anything else we need to know?

Below is the log snippet from chaosblade
time="2021-06-17T07:25:06Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=info msg="get err message" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool err="{"code":406,"success":false,"error":"data not found"}" out= podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=error msg="exec: k8s exec failed, err: {"code":406,"success":false,"error":"data not found"}\n" location=github.com/chaosblade-io/chaosblade-operator/exec/model.checkExperimentStatus.func1.1 uid=

Snippet from Exepriment status

{
"id": "6da0140f5db1fc4a",
"identifier": "sre-test/worker-eus2-lab-aasretest-vmss00000z/nginx-deployment-65c64b578d-9srs9/nginx/d562650fbf7d",
"kind": "container",
"state": "Success",
"success": true
},

Experiment Ran on Worker Node worker-***00000z which has DS pod chaosblade-tool-dlkj8. Instead of checking in chaosblade-tool-dlkj8, Operator was checking in chaosblade-tool-nlpjb and got data not found even though data was available in correct DS

Wrong DS selected for Status check

➜ k exec -it chaosblade-tool-dlkj8 -- /bin/bash [17/06/21 | 1:16:36]
bash-4.4# /opt/chaosblade/blade status 6da0140f5db1fc4a
{
"code": 200,
"success": true,
"result": {
"Uid": "6da0140f5db1fc4a",
"Command": "docker",
"SubCommand": "network delay",
"Flag": " --image-version=1.2.0 --interface=eth0 --image-repo=***/chaosbladeio/chaosblade-tool --offset=1000 --time=500 --container-id=d562650fbf7d --timeout=100 --local-port=8080",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2021-06-17T07:22:16.023395837Z",
"UpdateTime": "2021-06-17T07:24:20.821470897Z"
}
}
bash-4.4#

➜ k exec -it chaosblade-tool-nlpjb -- /bin/bash [17/06/21 | 1:18:41]
bash-4.4# /opt/chaosblade/blade status 6da0140f5db1fc4a
{"code":406,"success":false,"error":"data not found"}
bash-4.4#

After 22 retry Operator stopped logging any thing in stdout but saw that the Experiment was destroyed from the target containers. Attached the status of a destroyed experiment above. It got deleted after 120sec where time out was 100sec

@xcaspar xcaspar added the type/bug Something isn't working label Jun 24, 2021
@xcaspar xcaspar added this to the v1.3.0 milestone Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants