Network Experiment status check done on Wrong Daemon Set pod #516

NishantG01 · 2021-06-18T14:36:58Z

Issue Description

Type: bug report/question

Describe what happened (or what feature you want)

During the network experiment with timeout set, when the number of target application pods are higher, operator is trying to do the status check on a wrong DS pods and getting following error
{"code":406,"success":false,"error":"data not found"}

Describe what you expected to happen

Experiment should start and after time out period, operator should trigger a destroy command to Daemon set pods after the status check done.

How to reproduce it (as minimally and precisely as possible)

Inject network latency on 50 pods with evict-percent 100

Tell us your environment

K8s environment - application with 50 pods

Anything else we need to know?

Below is the log snippet from chaosblade
time="2021-06-17T07:25:06Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=info msg="get err message" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool err="{"code":406,"success":false,"error":"data not found"}" out= podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=error msg="exec: k8s exec failed, err: {"code":406,"success":false,"error":"data not found"}\n" location=github.com/chaosblade-io/chaosblade-operator/exec/model.checkExperimentStatus.func1.1 uid=

Snippet from Exepriment status

{
"id": "6da0140f5db1fc4a",
"identifier": "sre-test/worker-eus2-lab-aasretest-vmss00000z/nginx-deployment-65c64b578d-9srs9/nginx/d562650fbf7d",
"kind": "container",
"state": "Success",
"success": true
},

Experiment Ran on Worker Node worker-***00000z which has DS pod chaosblade-tool-dlkj8. Instead of checking in chaosblade-tool-dlkj8, Operator was checking in chaosblade-tool-nlpjb and got data not found even though data was available in correct DS

Wrong DS selected for Status check

➜ k exec -it chaosblade-tool-dlkj8 -- /bin/bash [17/06/21 | 1:16:36]
bash-4.4# /opt/chaosblade/blade status 6da0140f5db1fc4a
{
"code": 200,
"success": true,
"result": {
"Uid": "6da0140f5db1fc4a",
"Command": "docker",
"SubCommand": "network delay",
"Flag": " --image-version=1.2.0 --interface=eth0 --image-repo=***/chaosbladeio/chaosblade-tool --offset=1000 --time=500 --container-id=d562650fbf7d --timeout=100 --local-port=8080",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2021-06-17T07:22:16.023395837Z",
"UpdateTime": "2021-06-17T07:24:20.821470897Z"
}
}
bash-4.4#

➜ k exec -it chaosblade-tool-nlpjb -- /bin/bash [17/06/21 | 1:18:41]
bash-4.4# /opt/chaosblade/blade status 6da0140f5db1fc4a
{"code":406,"success":false,"error":"data not found"}
bash-4.4#

After 22 retry Operator stopped logging any thing in stdout but saw that the Experiment was destroyed from the target containers. Attached the status of a destroyed experiment above. It got deleted after 120sec where time out was 100sec

The text was updated successfully, but these errors were encountered:

xcaspar added the type/bug Something isn't working label Jun 24, 2021

xcaspar added this to the v1.3.0 milestone Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network Experiment status check done on Wrong Daemon Set pod #516

Network Experiment status check done on Wrong Daemon Set pod #516

NishantG01 commented Jun 18, 2021 •

edited

Loading

Network Experiment status check done on Wrong Daemon Set pod #516

Network Experiment status check done on Wrong Daemon Set pod #516

Comments

NishantG01 commented Jun 18, 2021 • edited Loading

Issue Description

Describe what happened (or what feature you want)

Describe what you expected to happen

How to reproduce it (as minimally and precisely as possible)

Tell us your environment

Anything else we need to know?

NishantG01 commented Jun 18, 2021 •

edited

Loading