Description
This is not 100% LeoFS problem as there was external factor present. So maybe there is no need to solve this directly but I wanted to report this in case there actually is some error (or error handling) in code that can be improved.
We were updating & rebooting cluster of storage nodes (firmware and kernel update against recent vulnerabilities). We have N=3, W=2, so maximum single node was touched at once. Procedure was as follows: new kernel was installed, each node was "suspended", then firmware update installed via IPMI without system interruption (current Supermicro systems support this), when it was over the system was rebooted - booting with new firmware, new kernel and after storage node was started, it was "resumed" from manager node.
During firmware update itself the system was working fine, aside from possible timer issues, PTP daemons were complaining in system logs:
Jan 10 18:56:53 stor02 ptp4l: [7085841.907] clockcheck: clock jumped backward or running slower than expected!
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] reconfiguring after port state change
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting eth1 as the master clock
Jan 10 18:56:53 stor02 ptp4l: [7085842.406] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] reconfiguring after port state change
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] master clock not ready, waiting...
Jan 10 18:56:54 stor02 ptp4l: [7085843.405] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 18:56:55 stor02 ptp4l: [7085843.912] clockcheck: clock jumped forward or running faster than expected!
Jan 10 18:56:55 stor02 phc2sys: [7085844.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] reconfiguring after port state change
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting eth1 as the master clock
Jan 10 18:56:55 stor02 ptp4l: [7085844.404] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] reconfiguring after port state change
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] master clock not ready, waiting...
There were seemingly no other problems; beam.smp of storage node ("suspended" at that time from manager, so no load, but obviously working as process) was rotating logs at hour change and things like that. It shut down properly as well after that, after firmware update was completed (during system reboot). I'm writing this to give better understanding that usually (at least for other systems) online update of firmware through IPMI web interface doesn't cause any service interruption, and here storage nodes weren't affected as well, as long as there was no external load. We updated nodes stor01-05 this way.
By some mistake, on the last node (stor06) firmware update was started without "suspending" node first. The firmware update (and timer errors) started at around 20:19:
Jan 10 20:19:35 stor06 ptp4l: [7090787.328] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090787.987] clockcheck: clock jumped forward or running faster than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090788.087] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 20:19:36 stor06 phc2sys: [7090788.386] port 0cc47a.fffe.e1d37b-1 changed state
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] reconfiguring after port state change
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] master clock not ready, waiting...
Jan 10 20:19:37 stor06 ptp4l: [7090789.086] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:38 stor06 ptp4l: [7090790.085] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 20:19:38 stor06 phc2sys: [7090790.387] port 0cc47a.fffe.e1d37b-1 changed state
4 minutes later ,at 20:23 the node started to produce very strange errors. In stdout/stderr (as captured per journalctl):
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30023.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.20.1
янв 10 20:23:45 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10019, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10019"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_14, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_14"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10057, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10057"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_48, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_48"
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_63 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs1/bodies/metadata/63/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs1/bodies/metadata/63",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_10063 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs2/bodies/metadata/10063/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs2/bodies/metadata/10063",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_20063, path:"/mnt/avs3/bodies/log/", filename:"leo_object_storage_20063"
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20063.20180110.20.2
янв 10 20:24:07 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_30063, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30063"
crash.log was created with tons of errors at that time as well, plus tons of errors in error.log and info.log; there are really tons and I'm unable to quote them here so attaching these files.
Just two interesting notes, first of all there were lots of "slow operation" errors like
[I] bodies@stor06.selectel.cloud.lan 2018-01-10 20:23:48.291096 +0300 1515605028 null:null 0 {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,head},{key,<<"body/8e/cc/80/8ecc80a03c2eb51fb5c90a07835dcfb321124310dbb6a27a14e4e6abc3db336d68a01a8ad7029b606fb3aa5ccbc9f13be756010000000000.xz">>},{processing_time,53244}]}
[I] bodies@stor06.selectel.cloud.lan 2018-01-10 20:23:48.292162 +0300 1515605028 null:null 0 {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,put},{key,<<"body/66/66/a0/6666a03a49997a3938c432fb46291fa76cd88518fa4912fc93b888c775b1f7c4634804c8fef8d65e7c2f38dc0c3f6a3b00e2060000000000.xz">>},{processing_time,52191}]}
and second is, I was unable to execute "mq-stats stor06" command from manager node. "status stor06" worked fine, "version all" worked fine as well, but "mq-stats" produced no result (exited immediately, without output or error). Also on other nodes, messages started to appear in replication queue. At this point (after observing situation long enough) the node was suspended, then rebooted.
"status" output for stor06:
--------------------------------------+--------------------------------------
Item | Value
--------------------------------------+--------------------------------------
Config-1: basic
--------------------------------------+--------------------------------------
version | 1.3.8
number of vnodes | 168
object containers | - path:[/mnt/avs1/bodies], # of containers:64
| - path:[/mnt/avs2/bodies], # of containers:64
| - path:[/mnt/avs3/bodies], # of containers:64
| - path:[/mnt/avs4/bodies], # of containers:64
log directory | /var/log/leofs/leo_storage/erlang
log level | info
--------------------------------------+--------------------------------------
Config-2: watchdog
--------------------------------------+--------------------------------------
[rex(rpc-proc)] |
check interval(s) | 10
threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
[cpu] |
enabled/disabled | disabled
check interval(s) | 10
threshold cpu load avg | 5.0
threshold cpu util(%) | 100
--------------------------------------+--------------------------------------
[disk] |
enabled/disalbed | disabled
check interval(s) | 10
threshold disk use(%) | 85
threshold disk util(%) | 90
threshold rkb(kb) | 98304
threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
Config-3: message-queue
--------------------------------------+--------------------------------------
number of procs/mq | 4
number of batch-procs of msgs | max:3000, regular:1600
interval between batch-procs (ms) | max:3000, regular:500
--------------------------------------+--------------------------------------
Config-4: autonomic operation
--------------------------------------+--------------------------------------
[auto-compaction] |
enabled/disabled | disabled
warning active size ratio (%) | 70
threshold active size ratio (%) | 60
number of parallel procs | 1
exec interval | 3600
--------------------------------------+--------------------------------------
Config-5: data-compaction
--------------------------------------+--------------------------------------
limit of number of compaction procs | 4
number of batch-procs of objs | max:1500, regular:1000
interval between batch-procs (ms) | max:3000, regular:500
--------------------------------------+--------------------------------------
Status-1: RING hash
--------------------------------------+--------------------------------------
current ring hash | 4bb5757b
previous ring hash | 4bb5757b
--------------------------------------+--------------------------------------
Status-2: Erlang VM
--------------------------------------+--------------------------------------
vm version | 8.3.5.3
total mem usage | 89009224
system mem usage | 62449984
procs mem usage | 26667536
ets mem usage | 6122632
procs | 1023/1048576
kernel_poll | true
thread_pool_size | 128
--------------------------------------+--------------------------------------
Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
replication messages | 206
vnode-sync messages | 0
rebalance messages | 0
--------------------------------------+--------------------------------------
Another (maybe unrealted) note is that startup of stor06 after that took very long time (> 40 min to start that node), the symptoms were as usual (highly fragmented leveldb .log files, startup disk IO bound), but it was kind of even slower than before, maybe it didn't shut down correctly after all and was trying to startup without state files or something. The important point here is that it took up to 20 seconds per 3-4 MB log file (with backend_db.eleveldb.write_buf_size = 5242880
setting), so in this configuration 30 minute startup timeout wasn't enough and systemd was killing node until the startup timeout was increased. Since 30 minute is default startup timeout in current repo, maybe it's best to increase that for everybody if there is no way to make node actually startup faster.
янв 10 21:44:09 stor06.selectel.cloud.lan leo_storage[17587]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20031.20180110.21.2
янв 10 21:44:29 stor06.selectel.cloud.lan leo_storage[17587]: id:leo_diagnosis_log_30031, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30031"
Full logs when this problem happened: stor06-error-logs.tar.gz
Activity