id | title | sidebar_label |
---|---|---|
t-cstor |
Troubleshooting OpenEBS - cStor |
cStor |
OpenEBS Documentation is now migrated to https://openebs.io/docs. The page you are currently viewing is a static snapshot and will be removed in the upcoming releases.
General guidelines for troubleshooting- Contact OpenEBS Community for support.
- Search for similar issues added in this troubleshooting section.
- Search for any reported issues on StackOverflow under OpenEBS tag
cStor volume become read only state
cStor pools, volumes are offline and pool manager pods are stuck in pending state
Pool Operation Hung Due to Bad Disk
Volume Migration when the underlying cStor pool is lost
One of the cStorVolumeReplica(CVR) will have its status as `Invalid` after corresponding pool pod gets recreated
When User delete a cStor pool pod, there are high chances for that corresponding pool-related CVR's can goes into Invalid
state.
Following is a sample output of kubectl get cvr -n openebs
Troubleshooting
Sample logs of cstor-pool-mgmt
when issue happens:
From the above highlighted logs, we can confirm cstor-pool-mgmt
in new pod is communicating with cstor-pool
in old pod as first highlighted log says cstor pool found
then next highlighted one says pool is really imported
.
Possible Reason:
When a cstor pool pod is deleted there are high chances that two cstor pool pods of same pool can present i.e old pool pod will be in Terminating
state(which means not all the containers completely terminated) and new pool pod will be in Running
state(might be few containers are in running state but not all). In this scenario cstor-pool-mgmt
container in new pool pod is communicating with cstor-pool
in old pool pod. This can cause CVR resource to set to Invalid
.
Note: This issue has observed in all OpenEBS versions up to 1.2.
Resolution:
Edit the Phase
of cStorVolumeReplica (cvr) from Invalid
to Offline
. After few seconds CVR will be Healthy
or Degraded
state depends on rebuilding progress.
Application mount point running on cStor volume went into read only state.
Possible Reason:
If cStorVolume
is Offline
or corresponding target pod is unavailable for more than 120 seconds(iSCSI timeout) then the PV will be mounted as read-only
filesystem. For understanding different states of cStor volume, more details can be found here.
Troubleshooting
Check the status of corresponding cStor volume using the following command:
kubectl get cstorvolume -n <openebs_installed_namespace> -l openebs.io/persistent-volume=<PV_NAME>
If cStor volume exists in Healthy
or Degraded
state then restarting of the application pod alone will bring back cStor volume to RW
mode. If cStor volume exists in Offline
, reach out to OpenEBS Community for assistance.
The cStor pools and volumes are offline, the pool manager pods are stuck in a
pending
state, as shown below:
$ kubectl get po -n openebs -l app=cstor-pool
Sample Output:
NAME READY STATUS RESTARTS AGE
cstor-cspc-chjg-85f65ff79d-pq9d2 0/3 Pending 0 16m
cstor-cspc-h99x-57888d4b5-kh42k 0/3 Pending 0 15m
cstor-cspc-xs4b-85dbbbb59b-wvhmr 0/3 Pending 0 18m
One such scenario that can lead to such a situation is, when the nodes have been scaled down and then scaled up. This results in nodes coming up with a different hostName and node name, i.e, the nodes that have come up are new nodes and not the same as previous nodes that existed earlier. Due to this, the disks that were attached to the older nodes now get attached to the newer nodes.
Troubleshooting
To bring cStor pool back to online state carry out the below mentioned steps,
-
Update validatingwebhookconfiguration resource's failurePolicy:
Update thevalidatingwebhookconfiguration
resource's failure policy toIgnore
. It would be previously set toFail
. This informs the kube-APIServer to ignore the error in case cStor admission server is not reachable. To edit, execute:$ kubectl edit validatingwebhookconfiguration openebs-cstor-validation-webhook
Sample Output with updated
failurePolicy
kind: ValidatingWebhookConfiguration metadata: name: openebs-cstor-validation-webhook ... ... webhooks: - admissionReviewVersions: - v1beta1 failurePolicy: Fail name: admission-webhook.cstor.openebs.io ... ...
-
Scale down the admission:
The openEBS admission server needs to be scaled down as this would skip the validations performed by cStor admission server when CSPC spec is updated with new node details.
$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=0
Sample Output:
deployment.extensions/openebs-cstor-admission-server scaled
-
Update the CSPC spec nodeSelector:
TheCStorPoolCluster
needs to be updated with the newnodeSelector
values. The updated CSPC now points to the new nodes instead of the old nodeSelectors.Update
kubernetes.io/hostname
with the new values.Sample Output:
apiVersion: cstor.openebs.io/v1
kind: CStorPoolCluster
metadata:
name: cstor-cspc
namespace: openebs
spec:
pools:
- nodeSelector:
kubernetes.io/hostname: "ip-192-168-25-235"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-798dbaf214f355ada15d097d87da248c"
poolConfig:
dataRaidGroupType: "stripe"
- nodeSelector:
kubernetes.io/hostname: "ip-192-168-33-15"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-4505d9d5f045b05995a5654b5493f8e0"
poolConfig:
dataRaidGroupType: "stripe"
- nodeSelector:
kubernetes.io/hostname: "ip-192-168-75-156"
dataRaidGroups:
- blockDevices:
- blockDeviceName: "blockdevice-c783e51a80bc51065402e5473c52d185"
poolConfig:
dataRaidGroupType: "stripe"
To apply the above configuration, execute:
$ kubectl apply -f cspc.yaml
-
Update nodeSelectors, labels and NodeName:
Next, the CSPI needs to be updated with the correct node details. Get the node details on which the previous blockdevice was attached and after fetching node details update hostName, nodeSelector values and
kubernetes.io/hostname
values in labels of CSPI with new details. To update, execute:kubectl edit cspi <cspi_name> -n openebs
NOTE: The same process needs to be repeated for all other CSPIs which are in pending state and belongs to the updated CSPC.
-
Verification:
On successful implementation of the above steps, the updated CSPI generates an event, pool is successfully imported which verifies the above steps have been completed successfully.kubectl describe cspi cstor-cspc-xs4b -n openebs
Sample Output:
... ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pool Imported 2m48s CStorPoolInstance Pool Import successful: cstor-07c4bfd1-aa1a-4346-8c38-f81d33070ab7
-
Scale-up the cStor admission server and update validatingwebhookconfiguration:
This brings back the cStor admission server to running state. As well as admission server is required to validate the modifications made to CSPC API in future.$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=1
Sample Output:
deployment.extensions/openebs-cstor-admission-server scaled
Now, update the
failurePolicy
back toFail
under validatingwebhookconfiguration. To edit, execute:$ kubectl edit validatingwebhookconfiguration openebs-cstor-validation-webhook
Sample Output:
validatingwebhookconfiguration.admissionregistration.k8s.io/openebs-cstor-validation-webhook edited
cStor scans all the devices on the node while it tries to import the pool in case there is a pool manager pod restart. Pool(s) are always imported before creation. On pool creation all of the devices are scanned and as there are no existing pool(s), a new pool is created. Now, when the pool is created the participating devices are cached for faster import of the pool (in case of pool manager pod restart). If the import utilises cache then this issue won't be hit but there is a chance of import without cache (when the pool is being created for the first time)
In such cases where pool import happens without cache file and if any of the devices(even the devices that are not part of the cStor pool) is bad and is not responding the command issued by cStor keeps on waiting and is stuck. As a result of this, pool manager pod is not able to issue any more command in order to reconcile the state of cStor pools or even perform the IO for the volumes that are placed on that particular pool.
Troubleshooting
This might be encountered because of one of the following situations:
- The device that has gone bad is actually a part of the cStor pool on the node. In such cases, Block device replacement needs to be done, the detailed steps to it can be found here.
Note: Block device replacement is not supported for stripe raid configuration. Please visit this link for some use cases and solutions.
- The device that has gone bad is not part of the cStor pool on the node. In this case, removing the bad disk from the node and restarting the pool manager pod with fix the problem.
- If the node is lost.
- If one or more disks participating in the cStor pool are lost. This occurs when the pool configuration is set to stripe.
- If all the disks participating in any raid group are lost. This occurs when the pool configuration is set to mirror.
- If the cStor pool configuration is raidz and more than 1 disk in any raid group is lost.
- If the cStor pool configuration is raidz2 and more than 2 disks in any raid group are lost.
This situation is often encountered in Kubernetes clusters that have autoscale feature enabled and nodes scale down and scale-up.
If the volume replica that resided on the lost pool was configured in high availability mode then the volume replica can be migrated to a new cStor pool.
NOTE:The CStorVolume associated to the volume replicas have to be migrated should be in Healthy state.
STEP 1:
Remove the cStorVolumeReplicas from the lost pool:
To remove the pool the CStorVolumeConfig
needs to updated. The poolName
for the corresponding pool needs to be removed from replicaPoolInfo
. This ensures that the admission server accepts the scale down request.
NOTE: Ensure that the cstorvolume and target pods are in running state.
A sample CVC resource(corresponding to the volume) that has 3 pools.
...
...
policy:
provision:
replicaAffinity: false
replica: {}
replicaPoolInfo:
- poolName: cstor-cspc-4tr5 // This pool needs to be removed
- poolName: cstor-cspc-xnxx
- poolName: cstor-cspc-zdvk
...
...
Now edit the CVC and remove the desired poolName.
$ kubectl edit cvc pvc-81746e7a-a29d-423b-a048-76edab0b0826 -n openebs
...
...
policy:
provision:
replicaAffinity: false
replica: {}
replicaPoolInfo:
- poolName: cstor-cspc-xnxx
- poolName: cstor-cspc-zdvk
...
...
From the above spec, cstor-cspc-4tr5
CSPI entry is removed. This needs to be repeated for all the volumes which have cStor volume replicas on the lost pool. To get the list of volume replicas in lost pool, execute:
$ kubectl get cvr -n openebs -l cstorpoolinstance.openebs.io/name=<CSPI_name>
NAME USED ALLOCATED STATUS AGE
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-bf9h 6K 6K Healthy 4m7s
STEP 2:
Remove the finalizer from cStor volume replicas
The CVRs need to be deleted from the etcd, this requires the finalizer
under cstorvolumereplica.openebs.io/finalizer
to be removed from the CVRs which were present on the lost cStor pool.
Usually, the finalizer is removed by pool-manager pod but as in this case the pod is not in running state hence manual intervention is required.
To get the list of CVRs, execute:
$ kubectl get cvr -n openebs
Sample Output:
NAME USED ALLOCATED STATUS AGE
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-xnxx 6K 6K Healthy 52m
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-zdvk 6K 6K Healthy 52m
After this step, CStorVolume will scale down. To verify, execute:
$ kubectl describe cvc <pv_name> -n openebs
Sample Output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingVolumeReplicas 6m10s cstorvolumeclaim-controller successfully scaled volume replicas to 2
STEP 3:
Remove the pool spec from CSPC belongs to lost node
Next, the corresponding CSPC needs to be edited and the pool spec that belongs to the nodes, which no longer exists, needs to be removed. To edit the cspc, execute:
kubectl edit cspc <cspc_name> -n openebs
This updates the number of desired instances.
To verify, execute:
$ kubectl get cspc -n openebs
Sample Output:
NAME HEALTHYINSTANCES PROVISIONEDINSTANCES DESIREDINSTANCES AGE
cstor-cspc 2 3 2 56m
Since CSPI has pool protection finalizer i.e openebs.io/pool-protection
the CSPC operator was unable to delete the CSPI. Due to this reason the count for provisioned instances still remains 3.
To fix this openebs.io/pool-protection
finalizer must be removed from the CSPI that was present on the lost node.
To edit, execute:
kubectl edit cspi <cspi_name>
After the finalizer is removed the CSPI count goes to the desired number.
$ kubectl get cspc -n openebs
NAME HEALTHYINSTANCES PROVISIONEDINSTANCES DESIREDINSTANCES AGE
cstor-cspc 2 2 2 68m
STEP 4:
Scale the cStorVolumeReplicas back to the original number
Scale the CStorVolumeReplicas back to the desired number on new or existing cStor pool where a volume replica of the same volume doesn't exist.
NOTE: A CStorVolume is a collection of 1 or more volume replicas and no two replicas of a CStorVolume should reside on the same CStorPoolInstance. CStorVolume is a custom resource and a logical aggregated representation of all the underlying cStor volume replicas for this particular volume.
To get the list of cspi execute:
$ kubectl get cspi -n openebs
Sample Output:
NAME HOSTNAME ALLOCATED FREE CAPACITY READONLY PROVISIONEDREPLICAS HEALTHYREPLICAS TYPE STATUS AGE
cstor-cspc-bf9h ip-192-168-49-174 230k 9630M 9630230k false 0 0 stripe ONLINE 66s
Next, add the newly created CStorPoolInstance under CVC.Spec
In this example we are adding, cstor-cspc-bf9h
To edit, execute:
$ kubectl edit cvc pvc-81746e7a-a29d-423b-a048-76edab0b0826 -n openebs
Sample YAML:
...
...
spec:
policy:
provision:
replicaAffinity: false
replica: {}
replicaPoolInfo:
- poolName: cstor-cspc-bf9h
- poolName: cstor-cspc-xnxx
- poolName: cstor-cspc-zdvk
...
...
The same needs to be repeated for all the scaled down cStor volumes. Next, verify the status of the new CStorVolumeReplica(CVR) that are provisioned.
To get the list of CVR, execute:
$ kubectl get cvr -n openebs
Sample Output:
NAME USED ALLOCATED STATUS AGE
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-bf9h 6K 6K Healthy 11m
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-xnxx 6K 6K Healthy 96m
pvc-81746e7a-a29d-423b-a048-76edab0b0826-cstor-cspc-zdvk 6K 6K Healthy 96m
To get the list of cspi, execute:
$ kubectl get cspi -n openebs
Sample Output:
NAME HOSTNAME ALLOCATED FREE CAPACITY READONLY PROVISIONEDREPLICAS HEALTHYREPLICAS TYPE STATUS AGE
cstor-cspc-bf9h ip-192-168-49-174 230k 9630M 9630230k false 1 1 stripe ONLINE 66s
cstor-cspc-xnxx ip-192-168-79-76 101k 9630M 9630101k false 1 1 stripe ONLINE 4m25s
cstor-cspc-zdvk ip-192-168-29-217 98k 9630M 9630098k false 1 1 stripe ONLINE 4m25s