Resynchronize system clocks after Ceph reports a clock skew.
Systems use chronyd
to synchronize their system clocks. If systems are not able to communicate, then the clocks can drift,
causing clock skew. Clock skew can also be caused by an individual or an automated task manually changing the clocks.
In this case, chronyd
may require a series of steps (time adjustments) to resynchronize the clocks.
Major time jumps where the clock is set back in time will require a full restart of all Ceph services.
Clock skew can cause issues with Kubernetes operations, etcd, node responsiveness, and more.
This procedure requires admin privileges.
-
Verify that the system is impacted by clock skew.
Ceph provides block storage and requires a clock skew of less than 0.05 seconds to report back healthy.
ceph -s
Example output:
cluster: id: b6d509e6-772e-4785-a421-e4a138b1780c health: HEALTH_WARN clock skew detected on mon.ncn-m002, mon.ncn-m003 services: mon: 3 daemons, quorum ncn-s001,ncn-s002,ncn-s003 (age 20h) mgr: ncn-s003(active, since 9d), standbys: ncn-s001, ncn-s002 mds: cephfs:1 {0=ncn-s002=up:active} 2 up:standby osd: 18 osds: 18 up (since 20h), 18 in (since 9d) rgw: 3 daemons active (ncn-s001.rgw0, ncn-s002.rgw0, ncn-s003.rgw0) data: pools: 10 pools, 224 pgs objects: 19.41k objects, 59 GiB usage: 167 GiB used, 274 GiB / 441 GiB avail pgs: 224 active+clean io: client: 919 KiB/s wr, 0 op/s rd, 16 op/s wr
IMPORTANT:
If you see this message in the Ceph logsunable to obtain rotating service keys; retrying
, it also indicates clock skew. You may have to runxzgrep skew *.xz
to see the skew if your logs have rolled over. -
View the Ceph health details.
-
View the Ceph logs.
If looking back to earlier logs, use the
xzgrep
command for theceph.log
or theceph-mon*.log
. There are cases where the MGR and OSD logs are not in theceph-mon
logs. This indicates that the skew was very drastic and sudden, causing theceph-mon
process to panic and not log the issue.grep skew /var/log/ceph/*.log
-
View the system time.
ansible ceph_all -m shell -a date
-
-
Sync the clocks to fix the issue.
systemctl restart chronyd.service
Wait a bit after running the command and the Ceph alert will clear. Restart the Ceph mon service on that node if the alert does not clear.
-
Check Ceph health to verify the clock skew issue is resolved.
It may take up to 15 minutes for this warning to resolve.
ceph -s
Example output:
cluster: id: 5f3b4031-d6c0-4118-94c0-bffd90b534eb health: HEALTH_OK services: mon: 3 daemons, quorum ncn-s001,ncn-s002,ncn-s003 (age 20h) mgr: ncn-s003(active, since 9d), standbys: ncn-s001, ncn-s002 mds: cephfs:1 {0=ncn-s002=up:active} 2 up:standby osd: 18 osds: 18 up (since 20h), 18 in (since 9d) rgw: 3 daemons active (ncn-s001.rgw0, ncn-s002.rgw0, ncn-s003.rgw0) data: pools: 11 pools, 240 pgs objects: 3.12k objects, 11 GiB usage: 45 GiB used, 39 GiB / 84 GiB avail pgs: 240 active+clean
If clocks are in sync and Ceph is still reporting skew, refer to Manage Ceph Services on restarting services.