Recover-node fails to recover all data on storage node

There are 6 storage nodes, N=3, W=2, R=1; everything is consistent at the start of experiment. Each node has 4 AVS directories (one per drive). There are no deleted / overwritten objects on cluster (just a tiny amount of multipart headers), ratio of active size is 100% on each node. The distribution of data is not even between nodes, but very even between AVS directories on each node (less than 0.3% difference). The cluster is running with https://github.com/leo-project/leofs/pull/876 changes.

The experiment is as follows:
1) I start upload of data to cluster. I will be uploading 500 GB of data, that is, it should increase disk usage on each node roughly by 500 * 3 / 6 = 250 GB, or ~62 GB per AVS directory on each node. In reality, due to uneven distribution, there will be up to 16% difference between nodes, so it will roughly 57 to 67 GB per AVS directory (but increase should be the same for each AVS directory on any node).
2) I suspend and stop first node - stor01 and remove one of AVS directories, simulating drive failure. I start node, resume it and execute "rebalance-node stor01". The queues on other node starts filling, lots of traffic flows to stor01. The lost AVS directory starts filling up.
3) I suspend stor03 and stop it, simulating another node failure in the middle of recover-node operation. N=3 so I'm supposed not to lose any data over this. Since queues on stor03 are large and trying to fill fast, stopping takes sometime (plus tons of badargs errors from leveldb right before it stops) - but this is expected.
4) I wipe all data on stor03, including all AVS files and all queues (including membership). I launch it and resume - it has the same name and automatically attaches to cluster, gets RING and starts receiving data that is uploaded to cluster. Upload of data doesn't stop because W=2 is sastisfied at all times.
5) I execute "recover-node stor03" (with recover-node stor01 still going on). I can see both stor01 and stor03 receiving lots of data. I wait till all queues are empty. 

There are "slow operation" messages in info logs, example log:
```
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:54.586854 +0300	1507749054	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,get},{key,<<"bod1/00/de/f1/00def1df121b056aba83f3dad7fa75538010002d93e62e31d3b122eb01a8863b426425d77f844c208439fb6f8043ab7688240a0000000000.xz">>},{processing_time,6839}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:54.587046 +0300	1507749054	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/3f/66/4d/3f664df2f427a0ac57cd3b4dcae7ae2d36a2a295e8f292372f0ea476ce0859ea8a139d2fe5f1b427d01c59c4de1d939de8b40d0000000000.xz">>},{processing_time,6410}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:10:59.729838 +0300	1507749059	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,5197}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:18.728820 +0300	1507749078	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,6609}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:18.728972 +0300	1507749078	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/55/55/ba/5555bad0ba80004a65f2be8e8740797517db050e471786913ffb05b5f8d8fdea183844e623cb960cada725e418072bdb009e090000000000.xz">>},{processing_time,5167}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.181764 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<>>},{processing_time,7453}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.181938 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod1/3f/f5/87/3ff5870173ffbd88eb9af762bffc495992ef49b91ecf2568a574d77172fb86084c1b400cec0e3e95fcdbd9317e4e16e828f4600000000000.xz">>},{processing_time,7372}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182116 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6307}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182272 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6297}]
[I]	bodies01@stor01.selectel.cloud.lan	2017-10-11 22:11:26.182399 +0300	1507749086	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,put},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{processing_time,6840}]
```
This is on step 3) while stor03 was stopped, at other times amount of such messages was way smaller (really insignificant amounts). There are some timeouts in gateway logs as well.

Soon after stor03 started, I got bunch of messages like this in error log on stor01/02/04/05/06:
```
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.616638 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_0,{get,{216869700495053906819207869156722907209,<<"bod1/00/42/c7/0042c78f42247f361f70be868be52e8ddbcbad83b57235175f4687be9f20bd1fb95158ffa5a0edb62200a9008c1eda5b00ad070000000000.xz">>},-1,-1,true,666268},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.757639 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_1,{get,{233468761328939195850605299707763487896,<<"bod1/00/23/b8/0023b899c8affee881b520131c1c1f5c70f8a5a598c31460dd4fc93b4b6a44e95f800257d0c9506f823ae2ec2fe73cfd00da070000000000.xz">>},-1,-1,true,666409},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:07:21.820630 +0300        1507748841      leo_mq_consumer:consume/4       526     [{module,leo_storage_mq},{id,leo_per_object_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_4_1,{get,{188914636427983509554839109418181432768,<<"bod1/00/14/b4/0014b4aa8dd8721b33c1bf871c65a27292390bee5670f31749cfe8c66326ecd24086db43c8c14e90a574d90a67a9631c00a6070000000000.xz">>},-1,-1,true,666472},30000]}}}]
[E]     bodies01@stor01.selectel.cloud.lan     2017-10-11 22:11:26.181940 +0300        1507749086      leo_storage_handler_object:put/4        424     [{from,storage},{method,delete},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{req_id,11017595},{cause,not_found}]
```
Their amount wasn't terribly big, they started to appear when stor03 finished startup and stopped about 15 minutes after.

Logs on stor03 have just these two warnings:
```
[W]     bodies03@stor03.selectel.cloud.lan     2017-10-12 00:05:32.20685 +0300 1507755932      leo_storage_read_repairer:compare/4     167     [{node,'bodies01@stor01.selectel.cloud.lan'},{addr_id,7946999318617476916794463740239994466},{key,<<"bod13/08/a9/cc/08a9cce0df7a51a65bc1da075a03714f8146cdf732308de9fc24262e67332bad1539086398b1688f8d622a6fc83db077b06e090100000000.xz\neca1aedfb5c7f8bf816f93c4c7f0e848">>},{clock,1507755930845926},{cause,not_found}]
[W]     bodies03@stor03.selectel.cloud.lan     2017-10-11 22:11:26.316603 +0300        1507749086      leo_storage_read_repairer:compare/4     167     [{node,'bodies01@stor01.selectel.cloud.lan'},{addr_id,59684934647226860873027859410263418647},{key,<<"bod13/31/1b/c3/311bc313abc833bada8a84829ccdf5ab34ae18cefb8dae65f503fbb3274083d60e99823d56302d65e898e6d218cc3c8cb06e090100000000.xz\ncb7e5b94e5564cc2c6f312a2b9276c45">>},{clock,1507749079342341},{cause,not_found}]
```
Expected, I suppose, given that for some objects copies both on stor01 and stor03 were lost and only third copy exists.

I don't see any real signs of problems up to here. However, this is end result after recovery was over, df output on stor03:
```
/dev/sdd2                    5,5T         805G  4,7T           15% /mnt/avs4
/dev/sdc2                    5,5T         650G  4,8T           12% /mnt/avs3
/dev/sda4                    5,5T         806G  4,7T           15% /mnt/avs1
/dev/sdb4                    5,5T         807G  4,7T           15% /mnt/avs2
```
This doesn't make sense. I'm clearly missing data in avs3. Each (of 64) AVS file there is around 10-11 GB and it's supposed to be around 13 GB like on other nodes. This is first problem.

On stor01, where I remove avs3 I get this
```
/dev/sdd2                    5,5T         888G  4,6T           17% /mnt/avs4
/dev/sdc2                    5,5T         465G  4,4T            9% /mnt/avs3
/dev/sda4                    5,5T         888G  4,6T           17% /mnt/avs1
/dev/sdb4                    5,5T         889G  4,6T           17% /mnt/avs2
```
At first glance, this is expected: since I stopped stor03 in the middle of preparing data to be pushed on stor01 to fill "/mnt/avs3" and deleted queues there, it couldn't restore everything on stor01. However, technically, since N=3, some *other* node could've pushed that data in place of stor03. Well, anyway, the real problem here is that amount of lost data - >400 GB - can't possibly match amount of data that stor03 was supposed to push into stor01! With 6 nodes, it should've been 1/6 of file at most, not nearly half of it.

Another (strange) fact is that stor03 is lacking data in "/mnt/avs3" as well. Which is.. strange. "/mnt/avs3" is what I removed on stor01, on stor03 I removed **all** avs directories. I double-checked it now, the create dates of surrounding directories and such - there should be nothing special about avs3 on stor03. No idea if this is coincidence or not.

That said, I didn't notice this discrepancy at that moment (and didn't notice that stor03 is lacking some data until now) and executed "recover-node stor01" again. There was no other load on cluster. After it was over, it got like this on stor01:
```
/dev/sdd2                    5,5T         888G  4,6T           17% /mnt/avs4
/dev/sdc2                    5,5T         777G  4,7T           15% /mnt/avs3
/dev/sda4                    5,5T         888G  4,6T           17% /mnt/avs1
/dev/sdb4                    5,5T         889G  4,6T           17% /mnt/avs2
```

Which is wrong as well. Some data that's supposed to be there clearly isn't there, even though there were no problems during "recover-node". Though stor03 is missing some data as well, but I don't know if it can affect this.

Amount of data missing on each of either stor01 or stor03 doesn't match amount of data I was uploading (as I calculated earlier, that should've been 57-67 GB per AVS directory). It also by no means matches amount of objects mentioned in error/info logs during these experiments which is relatively small.

I can show that data is missing another way as well. This is current "du" output:
```
[vm@bodies-master ~]$ leofs-adm du bodies01@stor01.selectel.cloud.lan
 active number of objects: 5770311
  total number of objects: 5861284
   active size of objects: 3691180624254
    total size of objects: 3691207752413
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:17:31 +0300
      last compaction end: 2017-09-24 19:18:24 +0300

[vm@bodies-master ~]$ leofs-adm du bodies03@stor03.selectel.cloud.lan
 active number of objects: 5144047
  total number of objects: 5149233
   active size of objects: 3291583488114
    total size of objects: 3291585038728
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[vm@bodies-master ~]$ leofs-adm du bodies06@stor06.selectel.cloud.lan
 active number of objects: 5254085
  total number of objects: 5359937
   active size of objects: 3360352546015
    total size of objects: 3360384109837
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:08:15 +0300
      last compaction end: 2017-09-24 19:09:02 +0300
```
From #846 I know that before delete and recover stor01 had 113% of objects compared to stor06 and stor03 had 103% of objects compared to stor06. Upload process doesn't change these numbers. But now distribution is - stor01 has 110% of objects comparing to stor06, and stor03 has 98% of objects. While counters used for "du" might be wrong under some conditions, here they match the amount of data missing in AVS files (e.g. stor01 is missing ~110 GB of data which is 3%).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover-node fails to recover all data on storage node #880

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development