Description
Situation: in cluster of 6 nodes running 1.4.2 a new node (node02) is getting recovered in takeover operation (node09 -> node02). This appears in log of the node02:
[E] node02@10.3.15.112 2018-07-30 16:32:20.639683 +0300 1532957540leo_storage_handler_object:get/4 191 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,3952335},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]
To check what's going on I execute whereis and it is very slow to respond, after some time it produces the output:
[root@leofs-mgr01 ~]# leofs-adm whereis production/parts/used/23/03/80/36038005/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
del? | node | ring address | size | checksum | has children | total chunks | clock | when
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
| node08@10.3.15.108 | 888f50b0ec29fca904f07d77d0823543 | 119K | a72711203d | false | 0 | 5702b35286880 | 2018-07-04 15:02:41 +0300
| node05@10.3.15.105 | | | | | | |
| node02@10.3.15.112 | | | | | | |
I execute whereis again and get this output:
[root@leofs-mgr01 ~]# leofs-adm whereis production/parts/used/23/03/80/36038005/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
del? | node | ring address | size | checksum | has children | total chunks | clock | when
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
| node08@10.3.15.108 | 888f50b0ec29fca904f07d77d0823543 | 119K | a72711203d | false | 0 | 5702b35286880 | 2018-07-04 15:02:41 +0300
| node05@10.3.15.105 | 888f50b0ec29fca904f07d77d0823543 | 119K | a72711203d | false | 0 | 5702b35286880 | 2018-07-04 15:02:41 +0300
| node02@10.3.15.112 | | | | | | |
There are two problems. First (minor) is that object didn't get recovered on node02 yet - however I suspect that it will be recovered at some point in the future (maybe it's in some queue on node02 right now which will be processed later, or maybe it will be pushed to this node by node05 or node08 in the future, I have no idea). The queues on all nodes (except for node02) are filled with millions of messages as part of recovery process at the moment so maybe read-repair just didn't have time to work yet.
Second, main problem is that first output of whereis was lying. I'm pretty sure that the object was present on node05 at that moment, it's just that request from manager to check its state failed. The output of "whereis" should clearly differentiate between missing object and inability to contact node (like if it's not running at all), or timed out request and produce different output for these two cases.
Just in case, logs related to this object (the timeout I had during first "whereis" operation is probably there somewhere as well). Error logs from node02 (nothing in info logs):
[W] node02@10.3.15.112 2018-07-30 16:31:50.638010 +0300 1532957510leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node02@10.3.15.112 2018-07-30 16:31:56.281774 +0300 1532957516leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node02@10.3.15.112 2018-07-30 16:32:20.638857 +0300 1532957540leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E] node02@10.3.15.112 2018-07-30 16:32:20.639683 +0300 1532957540leo_storage_handler_object:get/4 191 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,3952335},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]
[W] node02@10.3.15.112 2018-07-30 16:32:26.90911 +0300 1532957546 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node02@10.3.15.112 2018-07-30 16:32:26.282894 +0300 1532957546leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E] node02@10.3.15.112 2018-07-30 16:32:26.283730 +0300 1532957546leo_storage_handler_object:get/4 191 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,57890444},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]
[W] node02@10.3.15.112 2018-07-30 16:32:56.91867 +0300 1532957576 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E] node02@10.3.15.112 2018-07-30 16:32:56.92533 +0300 1532957576 leo_storage_handler_object:get/3 159 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,35201166},{cause,"Recover failure"}]
[W] node02@10.3.15.112 2018-07-30 16:33:10.514741 +0300 1532957590leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node02@10.3.15.112 2018-07-30 16:33:40.515824 +0300 1532957620leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E] node02@10.3.15.112 2018-07-30 16:33:40.516669 +0300 1532957620leo_storage_handler_object:get/3 159 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,70841817},{cause,"Recover failure"}]
Error logs from node05:
[W] node05@10.3.15.105 2018-07-30 16:31:20.790656 +0300 1532957480leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105 2018-07-30 16:31:26.435556 +0300 1532957486leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105 2018-07-30 16:31:56.243559 +0300 1532957516leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105 2018-07-30 16:32:40.666584 +0300 1532957560leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
Info logs from node05:
[I] node05@10.3.15.105 2018-07-30 16:34:14.196621 +0300 1532957654null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,143404}]}
[I] node05@10.3.15.105 2018-07-30 16:34:29.497739 +0300 1532957669null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,153253}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.65305 +0300 1532957684 null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,167629}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.93413 +0300 1532957684 null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,137848}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.129532 +0300 1532957684null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,123462}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.143682 +0300 1532957684null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,102028}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.169612 +0300 1532957684null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93501}]}
[I] node05@10.3.15.105 2018-07-30 16:34:44.184279 +0300 1532957684null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,91182}]}
Info logs from node08 (nothing in error logs):
[I] node08@10.3.15.108 2018-07-30 16:31:44.46812 +0300 1532957504 null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,83260}]}
[I] node08@10.3.15.108 2018-07-30 16:32:17.765815 +0300 1532957537null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,111333}]}
[I] node08@10.3.15.108 2018-07-30 16:32:31.449430 +0300 1532957551null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,100660}]}
[I] node08@10.3.15.108 2018-07-30 16:32:31.450507 +0300 1532957551null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,95209}]}
[I] node08@10.3.15.108 2018-07-30 16:32:31.451198 +0300 1532957551null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,95017}]}
[I] node08@10.3.15.108 2018-07-30 16:32:59.654923 +0300 1532957579null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,98865}]}
[I] node08@10.3.15.108 2018-07-30 16:32:59.656474 +0300 1532957579null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93414}]}
[I] node08@10.3.15.108 2018-07-30 16:32:59.657418 +0300 1532957579null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93222}]}
[I] node08@10.3.15.108 2018-07-30 16:32:59.661405 +0300 1532957579null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,78997}]}
[I] node08@10.3.15.108 2018-07-30 16:32:59.665188 +0300 1532957579null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,63422}]}
[I] node08@10.3.15.108 2018-07-30 16:33:12.936779 +0300 1532957592null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,62271}]}
[I] node08@10.3.15.108 2018-07-30 16:33:12.972406 +0300 1532957592null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,40859}]}
[I] node08@10.3.15.108 2018-07-30 16:33:12.997936 +0300 1532957592null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,32331}]}
[I] node08@10.3.15.108 2018-07-30 16:33:13.263 +0300 1532957593 null:null 0 {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,19107}]}
Activity