You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the network experiment with timeout set, when the number of target application pods are higher, operator is trying to do the status check on a wrong DS pods and getting following error
{"code":406,"success":false,"error":"data not found"}
Describe what you expected to happen
Experiment should start and after time out period, operator should trigger a destroy command to Daemon set pods after the status check done.
How to reproduce it (as minimally and precisely as possible)
Inject network latency on 50 pods with evict-percent 100
Tell us your environment
K8s environment - application with 50 pods
Anything else we need to know?
Below is the log snippet from chaosblade
time="2021-06-17T07:25:06Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=info msg="get err message" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool err="{"code":406,"success":false,"error":"data not found"}" out= podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=error msg="exec: k8s exec failed, err: {"code":406,"success":false,"error":"data not found"}\n" location=github.com/chaosblade-io/chaosblade-operator/exec/model.checkExperimentStatus.func1.1 uid=
Experiment Ran on Worker Node worker-***00000z which has DS pod chaosblade-tool-dlkj8. Instead of checking in chaosblade-tool-dlkj8, Operator was checking in chaosblade-tool-nlpjb and got data not found even though data was available in correct DS
After 22 retry Operator stopped logging any thing in stdout but saw that the Experiment was destroyed from the target containers. Attached the status of a destroyed experiment above. It got deleted after 120sec where time out was 100sec
The text was updated successfully, but these errors were encountered:
Issue Description
Type: bug report/question
Describe what happened (or what feature you want)
During the network experiment with timeout set, when the number of target application pods are higher, operator is trying to do the status check on a wrong DS pods and getting following error
{"code":406,"success":false,"error":"data not found"}
Describe what you expected to happen
Experiment should start and after time out period, operator should trigger a destroy command to Daemon set pods after the status check done.
How to reproduce it (as minimally and precisely as possible)
Tell us your environment
K8s environment - application with 50 pods
Anything else we need to know?
Below is the log snippet from chaosblade
time="2021-06-17T07:25:06Z" level=info msg="Exec command in pod" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=info msg="get err message" command="[/opt/chaosblade/blade status 6da0140f5db1fc4a]" container=chaosblade-tool err="{"code":406,"success":false,"error":"data not found"}" out= podName=chaosblade-tool-nlpjb podNamespace=sre-chaos
time="2021-06-17T07:25:07Z" level=error msg="exec: k8s exec failed, err: {"code":406,"success":false,"error":"data not found"}\n" location=github.com/chaosblade-io/chaosblade-operator/exec/model.checkExperimentStatus.func1.1 uid=
Snippet from Exepriment status
Experiment Ran on Worker Node worker-***00000z which has DS pod chaosblade-tool-dlkj8. Instead of checking in chaosblade-tool-dlkj8, Operator was checking in chaosblade-tool-nlpjb and got data not found even though data was available in correct DS
Wrong DS selected for Status check
After 22 retry Operator stopped logging any thing in stdout but saw that the Experiment was destroyed from the target containers. Attached the status of a destroyed experiment above. It got deleted after 120sec where time out was 100sec
The text was updated successfully, but these errors were encountered: