PWX-18756 fix race in restart and new req processing #190

sulakshm · 2021-02-24T12:50:21Z

Issue:
Stuck requests from block layer but fuse has none.
Assertion with invalid uid seen.

JIRA:
https://portworx.atlassian.net/browse/PWX-18756

Root cause:
1/ rcu usage incorrect, code accessing protected variables naked causing stale values to be read.
fc->connected and fc->allow_disconnected though protected with rcu read and sync calls (not all places),
were never protected against compiler optimization by accessing them WRITE_ONCE(), READ_ONCE() macros.
There are many places where updates to these happen, hence removed them altogether and switched to use memory barriers instead.

2/ when new requests get submitted in disconnected state, uids are still allocated, request initialized and freed again.
This can all be avoided by early return.

NOTES:

[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 11:58 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 11:58 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 11:58 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 11:58 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              26 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  80 GiB
        Total Capacity  :  1.2 TiB

[root@ip-70-0-69-50 px-fuse]# systemctl stop portworx
[root@ip-70-0-69-50 px-fuse]# rmmod px
[root@ip-70-0-69-50 px-fuse]# rm px_version.[co]
rm: remove regular file ‘px_version.c’? y
rm: remove regular file ‘px_version.o’? y
[root@ip-70-0-69-50 px-fuse]# git pull
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Total 4 (delta 3), reused 4 (delta 3), pack-reused 0
Unpacking objects: 100% (4/4), done.
From github.com:portworx/px-fuse
   ea201f1..4b02f3e  ln/v2.7.0-patch -> origin/ln/v2.7.0-patch
Updating ea201f1..4b02f3e
Fast-forward
 dev.c | 9 ---------
 pxd.c | 6 ++++--
 2 files changed, 4 insertions(+), 11 deletions(-)
[root@ip-70-0-69-50 px-fuse]# make
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
echo "const char *gitversion = \"ln/v2.7.0-patch:4b02f3e620640de97351e69bfd9da78259402ffa\";" > px_version.c
make  -C /usr/src/kernels/5.4.12-1.el7.elrepo.x86_64  M=/root/srcs/src/github.com/portworx/px-fuse modules
make[1]: Entering directory `/usr/src/kernels/5.4.12-1.el7.elrepo.x86_64'
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/pxd.o
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/dev.o
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/px_version.o
  LD [M]  /root/srcs/src/github.com/portworx/px-fuse/px.o
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
  Building modules, stage 2.
  MODPOST 1 modules
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/px.mod.o
  LD [M]  /root/srcs/src/github.com/portworx/px-fuse/px.ko
make[1]: Leaving directory `/usr/src/kernels/5.4.12-1.el7.elrepo.x86_64'
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# insmod ./px.ko
[root@ip-70-0-69-50 px-fuse]# systemctl restart portworx
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 12 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:15 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:15 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              14 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  42 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 12 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:15 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:15 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              14 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  42 GiB
        Total Capacity  :  1.2 TiB
(reverse-i-search)`start': systemctl re^Cart portworx
[root@ip-70-0-69-50 px-fuse]# pxctl v l
ID      NAME    SIZE    HA      SHARED  ENCRYPTED       PROXY-VOLUME    IO_PRIORITY     STATUS  SNAP-ENABLED
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
## hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# for vol in $(pxctl v l -j | jq .[].id | tr -d \");  do pxctl v d -f $vol; done
Volume 836919802202142390 successfully deleted.
Volume 612052366174175389 successfully deleted.
Volume 160666015937413773 successfully deleted.
Volume 842182307206036125 successfully deleted.
Volume 569312837180066461 successfully deleted.
Volume 527481524870771127 successfully deleted.
Volume 719934882592650322 successfully deleted.
Volume 597098631762899521 successfully deleted.
Volume 382788275924146125 successfully deleted.
Volume 381215105176489032 successfully deleted.
Volume 921560239242663031 successfully deleted.
Volume 534614494852012384 successfully deleted.
Volume 607503559611190288 successfully deleted.
Volume 630255754048847981 successfully deleted.
Volume 164376266147608086 successfully deleted.
Volume 344115851846402266 successfully deleted.
Volume 507184879209100122 successfully deleted.
Volume 253250678035732140 successfully deleted.
Volume 343171988103495087 successfully deleted.
Volume 197663608152700596 successfully deleted.
Volume 1066154862962147587 successfully deleted.
Volume 512053139621507649 successfully deleted.
Volume 379813018554287919 successfully deleted.
Volume 119844971981445962 successfully deleted.
Volume 152375959836060148 successfully deleted.
Volume 11309031015561148 successfully deleted.
Volume 281686904715745291 successfully deleted.
Volume 240984244162047593 successfully deleted.
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0      453
       0        0
       0        0
       0        0
       0      214
       0        0
       0        0
       0        0
       0        0
       0      129
       1      210
       1      387
       0        1
       2      114
       1      256
       0        0
       0      247
       0        0
       0      234
       0        0
       0      366
       1      116
       1      249
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# ### hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0      453
       0        0
       0        0
       0        0
       0      214
       0        0
       0        0
       0        0
       0        0
       0      129
       1      210
       1      387
       0        1
       2      114
       1      256
       0        0
       0      247
       0        0
       0      234
       0        0
       0      366
       1      116
       1      249
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:26 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:26 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              26 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              28 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              28 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  83 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# for vol in $(pxctl v l -j | jq .[].id | tr -d \");  do pxctl v d -f $vol; done
Volume 726979413421413154 successfully deleted.
Volume 570183114289491655 successfully deleted.
Volume 1001386735240115466 successfully deleted.
Volume 717699345993048202 successfully deleted.
Volume 887350118426723865 successfully deleted.
Volume 10572479190523077 successfully deleted.
Volume 1100003808439970068 successfully deleted.
Volume 704222129097173422 successfully deleted.
Volume 389073853284753603 successfully deleted.
Volume 113700620084412077 successfully deleted.
Volume 1073245171020293431 successfully deleted.
Volume 1073842049839578457 successfully deleted.
Volume 438959497027756748 successfully deleted.
Volume 759106970884226289 successfully deleted.
Volume 379141997930397359 successfully deleted.
Volume 169040391886579294 successfully deleted.
Volume 992644456314424250 successfully deleted.
Volume 517402392385723517 successfully deleted.
Volume 378911811916215535 successfully deleted.
Volume 753231858340370824 successfully deleted.
Volume 43117892067809681 successfully deleted.
Volume 865561370061464026 successfully deleted.
Volume 204161804573499944 successfully deleted.
Volume 570910950504955106 successfully deleted.
Volume 1135775733155949850 successfully deleted.
Volume 886769252060010227 successfully deleted.
Volume 423951072590475298 successfully deleted.
Volume 239242823161104052 successfully deleted.
Volume 501643584749479454 successfully deleted.
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1                            [screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# pxctl v l
ID      NAME    SIZE    HA      SHARED  ENCRYPTED       PROXY-VOLUME    IO_PRIORITY     STATUS  SNAP-ENABLED
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 13 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:26 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:26 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              15 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              16 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              16 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  47 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# ## hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       1      220
       1        0
       0        0
       0        0
       0      309
       2      130
       0        0
       0        0
       2      105
       1      252
       0      193
       1      151
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0      121
       0        0
       0        0
       0      241
       0      390
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:31 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:31 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:31 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:31 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              27 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  81 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]#

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

This reverts commit 033aeed.

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sangleganesh · 2021-02-25T02:15:08Z

@prabirpaul or @maxkozlovsky can you please review this ?

maxkozlovsky

syncrhonize_rcu() is necessary. Otherwise the thread can check the values of connected/!allow_disconnected, find them false and proceed queueing request. synchronize_rcu() makes sure that all operations that found disconnected false will finish before synchronize_rcu() returns to the caller. READ_ONCE() is not really necessary here as far as I can tell, even if they change while being read it is only can go towards disconnected state. d do READ_ONCE() is necessary with a rcu protected pointer value to avoid operating on two different pointers.

The race seems to be calling fuse_end_queued_requests() while the allocation is in progress so they end up operating on the same request. Just moving the check for disconnected before allocation should be enough.

maxkozlovsky · 2021-02-25T04:25:08Z

pxd.c

@@ -2097,6 +2101,7 @@ static int pxd_control_open(struct inode *inode, struct file *file)
 	file->private_data = fc;

 	pxdctx_set_connected(ctx, true);
+	wmb();


memory barrier is redundant here since spin_unlock just happened which does memory barrier

allow_disconnected is outside lock. And this path is called only once. wmb() ensures values get committed and sure visible after this call.

maxkozlovsky · 2021-02-25T04:25:37Z

pxd.c

@@ -2120,6 +2125,7 @@ static int pxd_control_release(struct inode *inode, struct file *file)
 		pxd_printk("%s: not opened\n", __func__);
 	} else {
 		ctx->fc.connected = 0;
+		wmb();


there is a barrier inside schedule_delayed_work this is redundant

wmb even if redundant assures of the modified value.

maxkozlovsky · 2021-02-25T04:27:18Z

pxd.c

 	/* Let other threads see the value of allow_disconnected. */
-	synchronize_rcu();
+	wmb();


wmb by itself does not do much.

synchronize_rcu() has a stronger memory barrier guarantee. By the time synchronize_rcu() returns all threads including this one has done a memory barrier and will see the updated value of allow_disconnected.

It did not happen. Earlier the usage was with synchronize_rcu() and read_[lock|unlock]_rcu. It did not ensure updated values get reflected in the IO enqueue path and that was the reason new requests got enqueued while abort handling was running. Also access to variables without READ/WRITE_ONCE macros allows those access to be reordered during execution.

In any case, with the posted changes the original issue of invalid index and assertion is not seen anymore.

sulakshm · 2021-02-25T08:29:55Z

The race seems to be calling fuse_end_queued_requests() while the allocation is in progress so they end up operating on the same request. Just moving the check for disconnected before allocation should be enough.

How could this race be open if the values of allow_disconnected is consistent? This is precisely addressed by wmb().
rcu lock/synchronize semantics did not address this.

sulakshm · 2021-02-25T08:39:11Z

pxd.c

@@ -993,7 +994,8 @@ static void pxd_rq_fn(struct request_queue *q)
 			break;

 		/* Filter out block requests we don't understand. */
-		if (BLK_RQ_IS_PASSTHROUGH(rq)) {
+		if (BLK_RQ_IS_PASSTHROUGH(rq) ||
+				(!fc->allow_disconnected && !fc->connected)) {


early return based on allow_disconnected and connected state. This ensures no allocation and immediate freeing of requests happen later.

sulakshm · 2021-02-25T08:39:25Z

pxd.c


-	if (BLK_RQ_IS_PASSTHROUGH(rq))
+	if (BLK_RQ_IS_PASSTHROUGH(rq) ||
+		(!fc->allow_disconnected && !fc->connected))


early return based on allow_disconnected and connected state. This ensures no allocation and immediate freeing of requests happen later.

use READ/WRITE_ONCE macro to work with connected/allow_disconnected var Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm

Issue not seen even with posted changes.

sulakshm · 2021-02-25T10:38:40Z

dev.c

 	/*
 	 * Ensures checking the value of allow_disconnected and adding request to
 	 * queue is done atomically.
 	 */
 	rcu_read_lock();

-	if (force) {


only allow queueing request to userspace if either of connected or allow_disconnected is true, even for fastpath requests. So removed force option.

sulakshm · 2021-02-25T10:39:28Z

dev.c

 {
-	req->in.unique = fuse_get_unique(fc);
-	fc->request_map[req->in.unique & (FUSE_MAX_REQUEST_IDS - 1)] = req;
-


modified queue_request to also assign unique ids. queue_request should get called only if userspace enqueue conditions are met.

sulakshm · 2021-02-25T10:40:07Z

dev.c

-		}
-		rcu_read_unlock();
-	} else if (fc->connected || fc->allow_disconnected) {
+	if (READ_ONCE(fc->connected) || READ_ONCE(fc->allow_disconnected)) {


modified all connected and allow_disconnected to use READ/WRITE_ONCE macros so no compiler optimizations/reordering happens.

sulakshm · 2021-02-25T10:43:51Z

dev.c

@@ -222,24 +231,15 @@ static void request_end(struct fuse_conn *fc, struct fuse_req *req,
 	if (shouldfree) fuse_request_free(req);
 }

-void fuse_request_send_nowait(struct fuse_conn *fc, struct fuse_req *req, bool force)
+void fuse_request_send_nowait(struct fuse_conn *fc, struct fuse_req *req)


only allow queueing request to userspace if either of connected or allow_disconnected is true, even for fastpath requests. So removed force option.

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm · 2021-02-25T10:55:18Z

dev.c

@@ -141,7 +141,13 @@ static void fuse_put_unique(struct fuse_conn *fc, u64 uid)
 {
 	struct fuse_per_cpu_ids *my_ids;
 	int num_free;
-	int cpu = get_cpu();
+	int cpu;
+


possible to call put without get unique id, handle it.

maxkozlovsky

It's ok,
note though that READ_ONCE() does not actually inhibit reordering. READ_ONCE() is equivalent to

std::atomic_load(&val, std::memory_order_relaxed);

Any other loads and stores can be reordered around it. It does guarantee that the value will be loaded once and only once. If READ_ONCE() happens in a loop, it would be done once per loop iteration.

maxkozlovsky · 2021-02-25T17:44:06Z

pxd.c


-	if (BLK_RQ_IS_PASSTHROUGH(rq))
+	if (BLK_RQ_IS_PASSTHROUGH(rq) ||
+		(!READ_ONCE(fc->allow_disconnected) && !READ_ONCE(fc->connected)))


Is the duplicate check necessary?

fc->connected can be killed, checking just for allow_disconnected is sufficient.

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm · 2021-02-25T18:58:15Z

It's ok,
note though that READ_ONCE() does not actually inhibit reordering. READ_ONCE() is equivalent to

std::atomic_load(&val, std::memory_order_relaxed);

Any other loads and stores can be reordered around it. It does guarantee that the value will be loaded once and only once. If READ_ONCE() happens in a loop, it would be done once per loop iteration.

introducing READ_ONCE() ensures that these variables have concurrent updates and values do not get cached before the point of access. A necessity for rcu critical section to work, without which compiler can cache/reorder access of vars as needed.

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

…193)" This reverts commit c83c546.

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

…193)" This reverts commit 525a5ee.

Lakshmi Narasimhan Sundararajan added 6 commits February 24, 2021 15:41

patch race

033aeed

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

write mem barrier

58fa768

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

kill incorrect rcu usage

cdbc8e8

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

Revert "patch race"

a3c375f

This reverts commit 033aeed.

keep early return when not connected

ea201f1

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

kill rcu usage

4b02f3e

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm requested a review from prabirpaul February 24, 2021 12:50

wmb at control open/release as well

892c0c5

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm requested review from nikhil-bhupale, ram-infrac, raviprasadsn and sangleganesh February 25, 2021 01:40

sangleganesh requested a review from maxkozlovsky February 25, 2021 02:14

maxkozlovsky reviewed Feb 25, 2021

View reviewed changes

sulakshm commented Feb 25, 2021

View reviewed changes

using rcu and minor reorg on uniq id alloc

0a57c08

use READ/WRITE_ONCE macro to work with connected/allow_disconnected var Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm commented Feb 25, 2021

View reviewed changes

put back orig wmb

5141256

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm commented Feb 25, 2021

View reviewed changes

maxkozlovsky approved these changes Feb 25, 2021

View reviewed changes

Lakshmi Narasimhan Sundararajan added 3 commits February 25, 2021 23:50

address review comments

f257339

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

review comment

8ad1a20

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

fix compilation

c1c0c3c

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm merged commit 07f1b44 into v2.7.0 Feb 25, 2021

sulakshm added a commit that referenced this pull request Feb 26, 2021

PWX-18756 fix race in restart and new req processing (#190)

d93c073

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm mentioned this pull request Feb 26, 2021

PWX-18756 fix race in restart and new req processing (#190) #193

Merged

sulakshm added a commit that referenced this pull request Feb 26, 2021

PWX-18756 fix race in restart and new req processing (#190) (#193)

85b820d

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm added a commit that referenced this pull request Mar 12, 2021

PWX-18756 fix race in restart and new req processing (#190) (#193)

8e94429

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

prabirpaul pushed a commit that referenced this pull request Mar 12, 2021

PWX-18756 fix race in restart and new req processing (#190) (#193)

c83c546

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

sulakshm added a commit that referenced this pull request Mar 12, 2021

PWX-18756 fix race in restart and new req processing (#190) (#193)

380b4cc

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

prabirpaul added a commit that referenced this pull request Mar 12, 2021

Revert "PWX-18756 fix race in restart and new req processing (#190) (#…

581d754

…193)" This reverts commit c83c546.

prabirpaul pushed a commit that referenced this pull request Mar 12, 2021

PWX-18756 fix race in restart and new req processing (#190) (#193)

525a5ee

* address race between restart and new req processing Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>

prabirpaul added a commit that referenced this pull request Mar 12, 2021

Revert "PWX-18756 fix race in restart and new req processing (#190) (#…

879a7c9

…193)" This reverts commit 525a5ee.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PWX-18756 fix race in restart and new req processing #190

PWX-18756 fix race in restart and new req processing #190

sulakshm commented Feb 24, 2021

sangleganesh commented Feb 25, 2021

maxkozlovsky left a comment

maxkozlovsky Feb 25, 2021

sulakshm Feb 25, 2021 •

edited

Loading

maxkozlovsky Feb 25, 2021

sulakshm Feb 25, 2021

maxkozlovsky Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm commented Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm left a comment

sulakshm Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm Feb 25, 2021

maxkozlovsky left a comment •

edited

Loading

maxkozlovsky Feb 25, 2021

sulakshm Feb 25, 2021

sulakshm commented Feb 25, 2021

PWX-18756 fix race in restart and new req processing #190

PWX-18756 fix race in restart and new req processing #190

Conversation

sulakshm commented Feb 24, 2021

sangleganesh commented Feb 25, 2021

maxkozlovsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulakshm Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulakshm commented Feb 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulakshm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxkozlovsky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sulakshm commented Feb 25, 2021

sulakshm Feb 25, 2021 •

edited

Loading

maxkozlovsky left a comment •

edited

Loading