Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network-loss tests do not work with minikube #979

Closed
ksatchit opened this issue Dec 4, 2019 · 9 comments
Closed

Network-loss tests do not work with minikube #979

ksatchit opened this issue Dec 4, 2019 · 9 comments

Comments

@ksatchit
Copy link
Member

ksatchit commented Dec 4, 2019

What happened:

Running a pod network loss experiment (and in all probability the network-delay test) on minikube is not seen to inject the desired chaos. Thanks to @LaumiH for discovering this.

This was observed with the following versions:

  • Minikube v1.2, Docker: 18.09.06 / K8s: 1.15.0
  • Minikube v1.5.2, Docker 18.09.9 / K8s: 1.16.2

The test involved setting up a ping to general/public IPs from inside the pod & also setting up a ping to the pod IP itself from a cluster node.

What you expected to happen:

  • The network chaos should be injected successfully.

How to reproduce it (as minimally and precisely as possible):

  • Setup a sample deployment and run the litmuschaos pod-network-loss experiment. Sample YAMLs are provided in the comments.

Anything else we need to know?:

  • These tests run successfully with said docker/k8s versions on non-minikube clusters (kubeadm, for example and older GKE clusters)
@ksatchit
Copy link
Member Author

ksatchit commented Dec 4, 2019

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: busy1
  name: busy1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busy1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: busy1
    spec:
      containers:
        - image: debian
          name: busy-1
          command: [ "/bin/bash", "-c", "sleep 10000;exit 0" ]

ChaosEngine

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine
  namespace: default
spec:
  jobCleanUpPolicy: delete
  monitoring: false
  appinfo:
    # app namespace
    appns: default
    # to see app label, apply kubectl get pods --show-labels
    applabel: 'app=busy1'
    # supported kinds: deployment, statefulset
    appkind: deployment
  chaosServiceAccount: 'nginx'
  experiments:
    - name: pod-network-loss
      spec:
        components:
        - name: TARGET_CONTAINER
          value: 'busy-1'
        - name: NETWORK_INTERFACE
          value: 'eth0'

@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

I opened an issue in the pumba repo, there it is also explained in detail what was tested yet.

The pumba netem itself runs successfully on my laptop with docker 19.

@ksatchit ksatchit changed the title Network-loss tests do not work with minkube Network-loss tests do not work with minikube Dec 4, 2019
@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

I have a small achievement: someone having the same problem in a pumba chat said that minikube might be missing the needed kernel module sch_netem. If I get a shell into the debian container inside minikube and run the bare netem command tc qdisc add dev eth0 root netem loss random 100, I get RTNETLINK answers: Operation not permitted as an answer.
I will continue looking into this. Do not expect any help from pumba developer, as he closed my issue saying it has something to do with virtualbox, so he can't help.

@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

I will try experimenting with running containers in privileged mode on Monday, let's see what this gives me. It seems the kernel module is there but the container does not have the privileges to execute the netem command.

@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

I now ran the container with

securityContext:
        privileged: true

and get Error: Specified qdisc not found.. Got the idea from here.
At least something else happens 🤣 .

Running the container in privileged mode changed nothing for pumba, it still does not work.

@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

Okay, I kept searching.
Ip addr says eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default . The qdisc noqueue part seemed interesting, so I googled a bit and found that it is the default for virtual devices (link). Maybe this is the reason. I just want to document what I am doing as to not forget until next week ^^.

Edit: I can run a tc -d qdisc show dev eth0 and the eth0 interface is found and gives qdisc noqueue 0: root refcnt 2 . Hm.

@LaumiH
Copy link
Member

LaumiH commented Dec 4, 2019

Maybe the netem kernel module is really missing. tc qdisc add dev eth0 root pfifo_fast works, but anything with netem fails, saying it is an invalid qdisc name. Maybe someone knows more than me.

@LaumiH
Copy link
Member

LaumiH commented Dec 9, 2019

I need to test it a bit further as it only works with priviledged containers, but for now my PR should fix the issue.

@LaumiH
Copy link
Member

LaumiH commented Dec 14, 2019

Minikube got patched with the missing kernel module, use version >= 1.6.0, released on 2019-12-10. The PR #991 from @ksatchit also has the effect that containers don't have to be privileged in order for netem to work, as netem is now executed on a seperate container. As far as I tested it, minikube now has no further limitations in netem related experiments, so this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants