Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device name change due to azure disk host cache setting #60344

Closed
andyzhangx opened this issue Feb 24, 2018 · 0 comments · Fixed by #60346
Closed

device name change due to azure disk host cache setting #60344

andyzhangx opened this issue Feb 24, 2018 · 0 comments · Fixed by #60346
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@andyzhangx
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What happened:
From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Following statefulset with 8 replicas will always fail due to dev name change.

apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: myvm
spec:
  selector:
    matchLabels:
      app: myvm # has to match .spec.template.metadata.labels
  serviceName: "myvm"
  replicas: 8
  template:
    metadata:
      labels:
        app: myvm 
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: ubuntu
        image: ubuntu:xenial
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;" ]
        livenessProbe:
          exec:
            command:
            - ls
            - /mnt/data
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            cpu: 100m
            memory: 250Mi
        volumeMounts:
        - name: mydata
          mountPath: /mnt/data
  volumeClaimTemplates:
  - metadata:
      name: mydata
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: default
      resources:
        requests:
          storage: 1Gi

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.7 - v1.10
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/sig azure
/assign

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/azure needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 24, 2018
@andyzhangx andyzhangx changed the title fix device name change due to azure disk host cache setting device name change due to azure disk host cache setting Feb 25, 2018
k8s-github-robot pushed a commit that referenced this issue Feb 25, 2018
Automatic merge from submit-queue (batch tested with PRs 60346, 60135, 60289, 59643, 52640). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix device name change issue for azure disk

**What this PR does / why we need it**:
fix device name change issue for azure disk due to default host cache setting changed from None to ReadWrite from v1.7, and default host cache setting in azure portal is `None`

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #60344, #57444
also fixes following issues:
Azure/acs-engine#1918
Azure/AKS#201

**Special notes for your reviewer**:
From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod.
For an example:
statefulset with 8 replicas(each with an azure disk) on one node will always fail, according to my observation, add the 6th data disk will always make dev name change, some pod could not access data disk after that.

I have verified this fix on v1.8.4
Without this PR on one node(dev name changes):
```
azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure
...
└── scsi1
    ├── lun0 -> ../../../sdk
    ├── lun1 -> ../../../sdj
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun4 -> ../../../sdg
    ├── lun5 -> ../../../sdh
    └── lun6 -> ../../../sdi
```

With this PR on one node(no dev name change):
```
azureuser@k8s-agentpool2-40588258-1:~$ tree /dev/disk/azure
...
└── scsi1
    ├── lun0 -> ../../../sdc
    ├── lun1 -> ../../../sdd
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun5 -> ../../../sdh
    └── lun6 -> ../../../sdi
```

Following `myvm-0`, `myvm-1` is crashing due to dev name change, after controller manager replacement, myvm2-x  pods work well.

```
Every 2.0s: kubectl get po                                                                                                                                                   Sat Feb 24 04:16:26 2018

NAME      READY     STATUS             RESTARTS   AGE
myvm-0    0/1       CrashLoopBackOff   13         41m
myvm-1    0/1       CrashLoopBackOff   11         38m
myvm-2    1/1       Running            0          35m
myvm-3    1/1       Running            0          33m
myvm-4    1/1       Running            0          31m
myvm-5    1/1       Running            0          29m
myvm-6    1/1       Running            0          26m

myvm2-0   1/1       Running            0          17m
myvm2-1   1/1       Running            0          14m
myvm2-2   1/1       Running            0          12m
myvm2-3   1/1       Running            0          10m
myvm2-4   1/1       Running            0          8m
myvm2-5   1/1       Running            0          5m
myvm2-6   1/1       Running            0          3m
```

**Release note**:

```
fix device name change issue for azure disk
```
/assign @karataliu 
/sig azure
@feiskyer  could you mark it as v1.10 milestone?
@brendandburns @khenidak @rootfs @jdumars FYI

Since it's a critical bug, I will cherry pick this fix to v1.7-v1.9, note that v1.6 does not have this issue since default cachingmode is `None`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants