Skip to content

"whereis" output can be wrong when the storage node is slow #1090

Closed
@vstax

Description

Situation: in cluster of 6 nodes running 1.4.2 a new node (node02) is getting recovered in takeover operation (node09 -> node02). This appears in log of the node02:

[E] node02@10.3.15.112  2018-07-30 16:32:20.639683 +0300    1532957540leo_storage_handler_object:get/4  191 [{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,3952335},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]

To check what's going on I execute whereis and it is very slow to respond, after some time it produces the output:

[root@leofs-mgr01 ~]# leofs-adm whereis production/parts/used/23/03/80/36038005/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node           |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when  
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | node08@10.3.15.108      | 888f50b0ec29fca904f07d77d0823543     |       119K |   a72711203d | false          |              0 | 5702b35286880  | 2018-07-04 15:02:41 +0300
       | node05@10.3.15.105      |                                      |            |              |                |                |                |
       | node02@10.3.15.112      |                                      |            |              |                |                |                |

I execute whereis again and get this output:

[root@leofs-mgr01 ~]# leofs-adm whereis production/parts/used/23/03/80/36038005/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node           |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | node08@10.3.15.108      | 888f50b0ec29fca904f07d77d0823543     |       119K |   a72711203d | false          |              0 | 5702b35286880  | 2018-07-04 15:02:41 +0300       
       | node05@10.3.15.105      | 888f50b0ec29fca904f07d77d0823543     |       119K |   a72711203d | false          |              0 | 5702b35286880  | 2018-07-04 15:02:41 +0300
       | node02@10.3.15.112      |                                      |            |              |                |                |                |

There are two problems. First (minor) is that object didn't get recovered on node02 yet - however I suspect that it will be recovered at some point in the future (maybe it's in some queue on node02 right now which will be processed later, or maybe it will be pushed to this node by node05 or node08 in the future, I have no idea). The queues on all nodes (except for node02) are filled with millions of messages as part of recovery process at the moment so maybe read-repair just didn't have time to work yet.

Second, main problem is that first output of whereis was lying. I'm pretty sure that the object was present on node05 at that moment, it's just that request from manager to check its state failed. The output of "whereis" should clearly differentiate between missing object and inability to contact node (like if it's not running at all), or timed out request and produce different output for these two cases.

Just in case, logs related to this object (the timeout I had during first "whereis" operation is probably there somewhere as well). Error logs from node02 (nothing in info logs):

[W]	node02@10.3.15.112	2018-07-30 16:31:50.638010 +0300	1532957510leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W]	node02@10.3.15.112	2018-07-30 16:31:56.281774 +0300	1532957516leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W]	node02@10.3.15.112	2018-07-30 16:32:20.638857 +0300	1532957540leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E]	node02@10.3.15.112	2018-07-30 16:32:20.639683 +0300	1532957540leo_storage_handler_object:get/4	191	[{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,3952335},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]
[W]	node02@10.3.15.112	2018-07-30 16:32:26.90911 +0300	1532957546	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W]	node02@10.3.15.112	2018-07-30 16:32:26.282894 +0300	1532957546leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E]	node02@10.3.15.112	2018-07-30 16:32:26.283730 +0300	1532957546leo_storage_handler_object:get/4	191	[{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,57890444},{etag,13886495139313085388585308446383676488},{cause,"Recover failure"}]
[W]	node02@10.3.15.112	2018-07-30 16:32:56.91867 +0300	1532957576	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E]	node02@10.3.15.112	2018-07-30 16:32:56.92533 +0300	1532957576	leo_storage_handler_object:get/3	159	[{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,35201166},{cause,"Recover failure"}]
[W]	node02@10.3.15.112	2018-07-30 16:33:10.514741 +0300	1532957590leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W]	node02@10.3.15.112	2018-07-30 16:33:40.515824 +0300	1532957620leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[E]	node02@10.3.15.112	2018-07-30 16:33:40.516669 +0300	1532957620leo_storage_handler_object:get/3	159	[{from,gateway},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{req_id,70841817},{cause,"Recover failure"}]

Error logs from node05:

[W] node05@10.3.15.105  2018-07-30 16:31:20.790656 +0300    1532957480leo_storage_handler_object:read_and_repair_3/3    1434    [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105  2018-07-30 16:31:26.435556 +0300    1532957486leo_storage_handler_object:read_and_repair_3/3    1434    [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105  2018-07-30 16:31:56.243559 +0300    1532957516leo_storage_handler_object:read_and_repair_3/3    1434    [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]
[W] node05@10.3.15.105  2018-07-30 16:32:40.666584 +0300    1532957560leo_storage_handler_object:read_and_repair_3/3    1434    [{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{cause,timeout}]

Info logs from node05:

[I] node05@10.3.15.105  2018-07-30 16:34:14.196621 +0300    1532957654null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,143404}]}
[I] node05@10.3.15.105  2018-07-30 16:34:29.497739 +0300    1532957669null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,153253}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.65305 +0300 1532957684  null:null   0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,167629}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.93413 +0300 1532957684  null:null   0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,137848}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.129532 +0300    1532957684null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,123462}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.143682 +0300    1532957684null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,102028}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.169612 +0300    1532957684null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93501}]}
[I] node05@10.3.15.105  2018-07-30 16:34:44.184279 +0300    1532957684null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,91182}]}

Info logs from node08 (nothing in error logs):

[I] node08@10.3.15.108  2018-07-30 16:31:44.46812 +0300 1532957504  null:null   0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,83260}]}
[I] node08@10.3.15.108  2018-07-30 16:32:17.765815 +0300    1532957537null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,111333}]}
[I] node08@10.3.15.108  2018-07-30 16:32:31.449430 +0300    1532957551null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,100660}]}
[I] node08@10.3.15.108  2018-07-30 16:32:31.450507 +0300    1532957551null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,95209}]}
[I] node08@10.3.15.108  2018-07-30 16:32:31.451198 +0300    1532957551null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,95017}]}
[I] node08@10.3.15.108  2018-07-30 16:32:59.654923 +0300    1532957579null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,98865}]}
[I] node08@10.3.15.108  2018-07-30 16:32:59.656474 +0300    1532957579null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93414}]}
[I] node08@10.3.15.108  2018-07-30 16:32:59.657418 +0300    1532957579null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,93222}]}
[I] node08@10.3.15.108  2018-07-30 16:32:59.661405 +0300    1532957579null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,78997}]}
[I] node08@10.3.15.108  2018-07-30 16:32:59.665188 +0300    1532957579null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,63422}]}
[I] node08@10.3.15.108  2018-07-30 16:33:12.936779 +0300    1532957592null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,62271}]}
[I] node08@10.3.15.108  2018-07-30 16:33:12.972406 +0300    1532957592null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,40859}]}
[I] node08@10.3.15.108  2018-07-30 16:33:12.997936 +0300    1532957592null:null 0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,get},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,32331}]}
[I] node08@10.3.15.108  2018-07-30 16:33:13.263 +0300   1532957593  null:null   0   {module,"leo_object_storage_msg_collector"},{function,"notify/4"},{line,107},{body,[{cause,"slow operation"},{method,head},{key,<<"production/parts/used/23/03/80/36038005/1.jpg">>},{processing_time,19107}]}

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions