Node entered "strange" state with internal crashes (timer problems?)

This is not 100% LeoFS problem as there was external factor present. So maybe there is no need to solve this directly but I wanted to report this in case there actually is some error (or error handling) in code that can be improved.

We were updating & rebooting cluster of storage nodes (firmware and kernel update against recent vulnerabilities). We have N=3, W=2, so maximum single node was touched at once. Procedure was as follows: new kernel was installed, each node was "suspended", then firmware update installed via IPMI without system interruption (current Supermicro systems support this), when it was over the system was rebooted - booting with new firmware, new kernel and after storage node was started, it was "resumed" from manager node.

During firmware update itself the system was working fine, aside from possible timer issues, PTP daemons were complaining in system logs:
```
Jan 10 18:56:53 stor02 ptp4l: [7085841.907] clockcheck: clock jumped backward or running slower than expected!
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] reconfiguring after port state change
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:53 stor02 phc2sys: [7085842.007] selecting eth1 as the master clock
Jan 10 18:56:53 stor02 ptp4l: [7085842.406] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] reconfiguring after port state change
Jan 10 18:56:54 stor02 phc2sys: [7085843.007] master clock not ready, waiting...
Jan 10 18:56:54 stor02 ptp4l: [7085843.405] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 18:56:55 stor02 ptp4l: [7085843.912] clockcheck: clock jumped forward or running faster than expected!
Jan 10 18:56:55 stor02 phc2sys: [7085844.007] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] reconfiguring after port state change
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting CLOCK_REALTIME for synchronization
Jan 10 18:56:55 stor02 phc2sys: [7085844.008] selecting eth1 as the master clock
Jan 10 18:56:55 stor02 ptp4l: [7085844.404] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] port ac1f6b.fffe.212073-1 changed state
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] reconfiguring after port state change
Jan 10 18:56:56 stor02 phc2sys: [7085845.008] master clock not ready, waiting...
```
There were seemingly no other problems; beam.smp of storage node ("suspended" at that time from manager, so no load, but obviously working as process) was rotating logs at hour change and things like that. It shut down properly as well after that, after firmware update was completed (during system reboot). I'm writing this to give better understanding that usually (at least for other systems) online update of firmware through IPMI web interface doesn't cause any service interruption, and here storage nodes weren't affected as well, as long as there was no external load. We updated nodes stor01-05 this way.

By some mistake, on the last node (stor06) firmware update was started without "suspending" node first. The firmware update (and timer errors) started at around 20:19:
```
Jan 10 20:19:35 stor06 ptp4l: [7090787.328] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090787.987] clockcheck: clock jumped forward or running faster than expected!
Jan 10 20:19:36 stor06 ptp4l: [7090788.087] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT
Jan 10 20:19:36 stor06 phc2sys: [7090788.386] port 0cc47a.fffe.e1d37b-1 changed state
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] reconfiguring after port state change
Jan 10 20:19:36 stor06 phc2sys: [7090788.387] master clock not ready, waiting...
Jan 10 20:19:37 stor06 ptp4l: [7090789.086] clockcheck: clock jumped backward or running slower than expected!
Jan 10 20:19:38 stor06 ptp4l: [7090790.085] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Jan 10 20:19:38 stor06 phc2sys: [7090790.387] port 0cc47a.fffe.e1d37b-1 changed state
```
4 minutes later ,at 20:23 the node started to produce very strange errors. In stdout/stderr (as captured per journalctl):
```
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30023.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs4/bodies/log/leo_object_storage_30018.20180110.20.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * closing log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.19.1
янв 10 20:00:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs2/bodies/log/leo_object_storage_10007.20180110.20.1
янв 10 20:23:45 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10019, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10019"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_14, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_14"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_10057, path:"/mnt/avs2/bodies/log/", filename:"leo_object_storage_10057"
янв 10 20:23:48 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_48, path:"/mnt/avs1/bodies/log/", filename:"leo_object_storage_48"
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_63 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs1/bodies/metadata/63/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_63,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs1/bodies/metadata/63",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_sup:131 - leo_metadata_10063 {error,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {{db_open,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "IO error: lock /mnt/avs2/bodies/metadata/10063/LOCK: already held by process"},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {child,undefined,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: {leo_backend_db_server,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: start_link,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_metadata_10063,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: leo_backend_db_eleveldb,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: "/mnt/avs2/bodies/metadata/10063",
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: false]},
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: permanent,2000,worker,
янв 10 20:23:49 stor06.selectel.cloud.lan leo_storage[27782]: [leo_backend_db_server]}}}
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_20063, path:"/mnt/avs3/bodies/log/", filename:"leo_object_storage_20063"
янв 10 20:23:59 stor06.selectel.cloud.lan leo_storage[27782]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20063.20180110.20.2
янв 10 20:24:07 stor06.selectel.cloud.lan leo_storage[27782]: id:leo_diagnosis_log_30063, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30063"
```
crash.log was created with tons of errors at that time as well, plus tons of errors in error.log and info.log; there are really tons and I'm unable to quote them here so attaching these files.

Just two interesting notes, first of all there were lots of "slow operation" errors like
```
[I]     bodies@stor06.selectel.cloud.lan        2018-01-10 20:23:48.291096 +0300        1515605028      null:null       0       {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,head},{key,<<"body/8e/cc/80/8ecc80a03c2eb51fb5c90a07835dcfb321124310dbb6a27a14e4e6abc3db336d68a01a8ad7029b606fb3aa5ccbc9f13be756010000000000.xz">>},{processing_time,53244}]}
[I]     bodies@stor06.selectel.cloud.lan        2018-01-10 20:23:48.292162 +0300        1515605028      null:null       0       {module,"leo_object_storage_event"},{function,"handle_event/2"},{line,54},{body,[{cause,"slow operation"},{method,put},{key,<<"body/66/66/a0/6666a03a49997a3938c432fb46291fa76cd88518fa4912fc93b888c775b1f7c4634804c8fef8d65e7c2f38dc0c3f6a3b00e2060000000000.xz">>},{processing_time,52191}]}
```
and second is, I was unable to execute "mq-stats stor06" command from manager node. "status stor06" worked fine, "version all" worked fine as well, but "mq-stats" produced no result (exited immediately, without output or error). Also on other nodes, messages started to appear in replication queue. At this point (after observing situation long enough) the node was suspended, then rebooted.
"status" output for stor06:
```
--------------------------------------+--------------------------------------
                Item                  |                 Value                
--------------------------------------+--------------------------------------
 Config-1: basic
--------------------------------------+--------------------------------------
                              version | 1.3.8
                     number of vnodes | 168
                    object containers | - path:[/mnt/avs1/bodies], # of containers:64
                                                  | - path:[/mnt/avs2/bodies], # of containers:64
                                                  | - path:[/mnt/avs3/bodies], # of containers:64
                                                  | - path:[/mnt/avs4/bodies], # of containers:64
                        log directory | /var/log/leofs/leo_storage/erlang
                            log level | info
--------------------------------------+--------------------------------------
 Config-2: watchdog
--------------------------------------+--------------------------------------
 [rex(rpc-proc)]                      |
                    check interval(s) | 10
               threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
 [cpu]                                |
                     enabled/disabled | disabled
                    check interval(s) | 10
               threshold cpu load avg | 5.0
                threshold cpu util(%) | 100
--------------------------------------+--------------------------------------
 [disk]                               |
                     enabled/disalbed | disabled
                    check interval(s) | 10
                threshold disk use(%) | 85
               threshold disk util(%) | 90
                    threshold rkb(kb) | 98304
                    threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
 Config-3: message-queue
--------------------------------------+--------------------------------------
                   number of procs/mq | 4
        number of batch-procs of msgs | max:3000, regular:1600
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Config-4: autonomic operation
--------------------------------------+--------------------------------------
 [auto-compaction]                    |
                     enabled/disabled | disabled
        warning active size ratio (%) | 70
      threshold active size ratio (%) | 60
             number of parallel procs | 1
                        exec interval | 3600
--------------------------------------+--------------------------------------
 Config-5: data-compaction
--------------------------------------+--------------------------------------
  limit of number of compaction procs | 4
        number of batch-procs of objs | max:1500, regular:1000
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Status-1: RING hash
--------------------------------------+--------------------------------------
                    current ring hash | 4bb5757b
                   previous ring hash | 4bb5757b
--------------------------------------+--------------------------------------
 Status-2: Erlang VM
--------------------------------------+--------------------------------------
                           vm version | 8.3.5.3
                      total mem usage | 89009224
                     system mem usage | 62449984
                      procs mem usage | 26667536
                        ets mem usage | 6122632
                                procs | 1023/1048576
                          kernel_poll | true
                     thread_pool_size | 128
--------------------------------------+--------------------------------------
 Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
                 replication messages | 206
                  vnode-sync messages | 0
                   rebalance messages | 0
--------------------------------------+--------------------------------------
```
Another (maybe unrealted) note is that startup of stor06 after that took very long time (> 40 min to start that node), the symptoms were as usual (highly fragmented leveldb .log files, startup disk IO bound), but it was kind of even slower than before, maybe it didn't shut down correctly after all and was trying to startup without state files or something. The important point here is that it took up to 20 seconds per 3-4 MB log file (with `backend_db.eleveldb.write_buf_size = 5242880` setting), so in this configuration 30 minute startup timeout wasn't enough and systemd was killing node until the startup timeout was increased. Since 30 minute is default startup timeout in current repo, maybe it's best to increase that for everybody if there is no way to make node actually startup faster.
```
янв 10 21:44:09 stor06.selectel.cloud.lan leo_storage[17587]: * opening log file is /mnt/avs3/bodies/log/leo_object_storage_20031.20180110.21.2
янв 10 21:44:29 stor06.selectel.cloud.lan leo_storage[17587]: id:leo_diagnosis_log_30031, path:"/mnt/avs4/bodies/log/", filename:"leo_object_storage_30031"
```
Full logs when this problem happened: [stor06-error-logs.tar.gz](https://github.com/leo-project/leofs/files/1620075/stor06-error-logs.tar.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node entered "strange" state with internal crashes (timer problems?) #966

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development