Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-18756 fix race in restart and new req processing #190

Merged
merged 12 commits into from
Feb 25, 2021
Merged

Conversation

sulakshm
Copy link
Contributor

Issue:
Stuck requests from block layer but fuse has none.
Assertion with invalid uid seen.

JIRA:
https://portworx.atlassian.net/browse/PWX-18756

Root cause:
1/ rcu usage incorrect, code accessing protected variables naked causing stale values to be read.
fc->connected and fc->allow_disconnected though protected with rcu read and sync calls (not all places),
were never protected against compiler optimization by accessing them WRITE_ONCE(), READ_ONCE() macros.
There are many places where updates to these happen, hence removed them altogether and switched to use memory barriers instead.

2/ when new requests get submitted in disconnected state, uids are still allocated, request initialized and freed again.
This can all be avoided by early return.

NOTES:

[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 11:58 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 11:58 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 11:58 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 11:58 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              26 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  80 GiB
        Total Capacity  :  1.2 TiB

[root@ip-70-0-69-50 px-fuse]# systemctl stop portworx
[root@ip-70-0-69-50 px-fuse]# rmmod px
[root@ip-70-0-69-50 px-fuse]# rm px_version.[co]
rm: remove regular file ‘px_version.c’? y
rm: remove regular file ‘px_version.o’? y
[root@ip-70-0-69-50 px-fuse]# git pull
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Total 4 (delta 3), reused 4 (delta 3), pack-reused 0
Unpacking objects: 100% (4/4), done.
From github.com:portworx/px-fuse
   ea201f1..4b02f3e  ln/v2.7.0-patch -> origin/ln/v2.7.0-patch
Updating ea201f1..4b02f3e
Fast-forward
 dev.c | 9 ---------
 pxd.c | 6 ++++--
 2 files changed, 4 insertions(+), 11 deletions(-)
[root@ip-70-0-69-50 px-fuse]# make
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
echo "const char *gitversion = \"ln/v2.7.0-patch:4b02f3e620640de97351e69bfd9da78259402ffa\";" > px_version.c
make  -C /usr/src/kernels/5.4.12-1.el7.elrepo.x86_64  M=/root/srcs/src/github.com/portworx/px-fuse modules
make[1]: Entering directory `/usr/src/kernels/5.4.12-1.el7.elrepo.x86_64'
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/pxd.o
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/dev.o
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/px_version.o
  LD [M]  /root/srcs/src/github.com/portworx/px-fuse/px.o
Kernel version 5.4 supports blkmq driver model.
Kernel fast path enabled, current kernel version 5.4.12-1.el7.elrepo.x86_64 need minimum 3.10
  Building modules, stage 2.
  MODPOST 1 modules
  CC [M]  /root/srcs/src/github.com/portworx/px-fuse/px.mod.o
  LD [M]  /root/srcs/src/github.com/portworx/px-fuse/px.ko
make[1]: Leaving directory `/usr/src/kernels/5.4.12-1.el7.elrepo.x86_64'
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# insmod ./px.ko
[root@ip-70-0-69-50 px-fuse]# systemctl restart portworx
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 12 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:15 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:15 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              14 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  42 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 12 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:15 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:15 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:15 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              14 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              14 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  42 GiB
        Total Capacity  :  1.2 TiB
(reverse-i-search)`start': systemctl re^Cart portworx
[root@ip-70-0-69-50 px-fuse]# pxctl v l
ID      NAME    SIZE    HA      SHARED  ENCRYPTED       PROXY-VOLUME    IO_PRIORITY     STATUS  SNAP-ENABLED
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
## hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# for vol in $(pxctl v l -j | jq .[].id | tr -d \");  do pxctl v d -f $vol; done
Volume 836919802202142390 successfully deleted.
Volume 612052366174175389 successfully deleted.
Volume 160666015937413773 successfully deleted.
Volume 842182307206036125 successfully deleted.
Volume 569312837180066461 successfully deleted.
Volume 527481524870771127 successfully deleted.
Volume 719934882592650322 successfully deleted.
Volume 597098631762899521 successfully deleted.
Volume 382788275924146125 successfully deleted.
Volume 381215105176489032 successfully deleted.
Volume 921560239242663031 successfully deleted.
Volume 534614494852012384 successfully deleted.
Volume 607503559611190288 successfully deleted.
Volume 630255754048847981 successfully deleted.
Volume 164376266147608086 successfully deleted.
Volume 344115851846402266 successfully deleted.
Volume 507184879209100122 successfully deleted.
Volume 253250678035732140 successfully deleted.
Volume 343171988103495087 successfully deleted.
Volume 197663608152700596 successfully deleted.
Volume 1066154862962147587 successfully deleted.
Volume 512053139621507649 successfully deleted.
Volume 379813018554287919 successfully deleted.
Volume 119844971981445962 successfully deleted.
Volume 152375959836060148 successfully deleted.
Volume 11309031015561148 successfully deleted.
Volume 281686904715745291 successfully deleted.
Volume 240984244162047593 successfully deleted.
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0      453
       0        0
       0        0
       0        0
       0      214
       0        0
       0        0
       0        0
       0        0
       0      129
       1      210
       1      387
       0        1
       2      114
       1      256
       0        0
       0      247
       0        0
       0      234
       0        0
       0      366
       1      116
       1      249
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# ### hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0      453
       0        0
       0        0
       0        0
       0      214
       0        0
       0        0
       0        0
       0        0
       0      129
       1      210
       1      387
       0        1
       2      114
       1      256
       0        0
       0      247
       0        0
       0      234
       0        0
       0      366
       1      116
       1      249
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:26 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:26 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              26 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              28 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              28 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  83 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# for vol in $(pxctl v l -j | jq .[].id | tr -d \");  do pxctl v d -f $vol; done
Volume 726979413421413154 successfully deleted.
Volume 570183114289491655 successfully deleted.
Volume 1001386735240115466 successfully deleted.
Volume 717699345993048202 successfully deleted.
Volume 887350118426723865 successfully deleted.
Volume 10572479190523077 successfully deleted.
Volume 1100003808439970068 successfully deleted.
Volume 704222129097173422 successfully deleted.
Volume 389073853284753603 successfully deleted.
Volume 113700620084412077 successfully deleted.
Volume 1073245171020293431 successfully deleted.
Volume 1073842049839578457 successfully deleted.
Volume 438959497027756748 successfully deleted.
Volume 759106970884226289 successfully deleted.
Volume 379141997930397359 successfully deleted.
Volume 169040391886579294 successfully deleted.
Volume 992644456314424250 successfully deleted.
Volume 517402392385723517 successfully deleted.
Volume 378911811916215535 successfully deleted.
Volume 753231858340370824 successfully deleted.
Volume 43117892067809681 successfully deleted.
Volume 865561370061464026 successfully deleted.
Volume 204161804573499944 successfully deleted.
Volume 570910950504955106 successfully deleted.
Volume 1135775733155949850 successfully deleted.
Volume 886769252060010227 successfully deleted.
Volume 423951072590475298 successfully deleted.
Volume 239242823161104052 successfully deleted.
Volume 501643584749479454 successfully deleted.
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1                            [screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# pxctl v l
ID      NAME    SIZE    HA      SHARED  ENCRYPTED       PROXY-VOLUME    IO_PRIORITY     STATUS  SNAP-ENABLED
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 13 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:26 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:26 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:26 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              15 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              16 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              16 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  47 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]#
[root@ip-70-0-69-50 px-fuse]# ## hit oom
[root@ip-70-0-69-50 px-fuse]# echo 30 > /sys/devices/pxd/1/timeout
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       1      220
       1        0
       0        0
       0        0
       0      309
       2      130
       0        0
       0        0
       2      105
       1      252
       0      193
       1      151
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0      121
       0        0
       0        0
       0      241
       0      390
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# cat /sys/block/pxd\!pxd*/inflight
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
       0        0
[root@ip-70-0-69-50 px-fuse]# pxctl status
Status: PX is operational
License: Trial (expires in 9 days)
Node ID: b3a7f832-6dea-4b42-aa9f-859540cc57aa
        IP: 70.0.69.50
        Local Storage Pool: 2 pools
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE   REGION
        0       HIGH            raid0           32 GiB  1.6 GiB Online  defaultdefault
        1       HIGH            raid0           381 GiB 25 GiB  Online  defaultdefault
        Local Storage Devices: 4 devices
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      32 GiB          24 Feb 21 12:31 UTC
        1:1     /dev/sde2       STORAGE_MEDIUM_SSD      125 GiB         24 Feb 21 12:31 UTC
        1:2     /dev/sdc        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:31 UTC
        1:3     /dev/sdd        STORAGE_MEDIUM_SSD      128 GiB         24 Feb 21 12:31 UTC
        total                   -                       413 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sde1       STORAGE_MEDIUM_SSD
Cluster Summary
        Cluster ID: MAN-PWX-18127-2.7.0
        Cluster UUID: 26c7b572-6f77-4977-94b4-1b82428209fc
        Scheduler: none
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName       StorageNode     Used    Capacity        Status  StorageStatus   VersionKernel                           OS
        70.0.69.50      b3a7f832-6dea-4b42-aa9f-859540cc57aa    N/A            Yes              27 GiB  413 GiB         Online  Up (This node)  2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.41.41      aa7dc493-b77d-4a41-8eff-be1ee87a2308    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
        70.0.84.111     5d76e438-ed30-409f-8319-c690bf62450b    N/A            Yes              27 GiB  413 GiB         Online  Up              2.7.0.0-83e99545.4.12-1.el7.elrepo.x86_64       CentOS Linux 7 (Core)
Global Storage Pool
        Total Used      :  81 GiB
        Total Capacity  :  1.2 TiB
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-1
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-2
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD slow-diff-3
[screen is terminating]
[root@ip-70-0-69-50 px-fuse]# screen -RD start-test
[detached from 6700.start-test]
[root@ip-70-0-69-50 px-fuse]#

Lakshmi Narasimhan Sundararajan added 6 commits February 24, 2021 15:41
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
This reverts commit 033aeed.
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
@sulakshm sulakshm requested a review from prabirpaul February 24, 2021 12:50
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
@sangleganesh
Copy link

@prabirpaul or @maxkozlovsky can you please review this ?

Copy link

@maxkozlovsky maxkozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncrhonize_rcu() is necessary. Otherwise the thread can check the values of connected/!allow_disconnected, find them false and proceed queueing request. synchronize_rcu() makes sure that all operations that found disconnected false will finish before synchronize_rcu() returns to the caller. READ_ONCE() is not really necessary here as far as I can tell, even if they change while being read it is only can go towards disconnected state. d do READ_ONCE() is necessary with a rcu protected pointer value to avoid operating on two different pointers.

The race seems to be calling fuse_end_queued_requests() while the allocation is in progress so they end up operating on the same request. Just moving the check for disconnected before allocation should be enough.

pxd.c Outdated
@@ -2097,6 +2101,7 @@ static int pxd_control_open(struct inode *inode, struct file *file)
file->private_data = fc;

pxdctx_set_connected(ctx, true);
wmb();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory barrier is redundant here since spin_unlock just happened which does memory barrier

Copy link
Contributor Author

@sulakshm sulakshm Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_disconnected is outside lock. And this path is called only once. wmb() ensures values get committed and sure visible after this call.

pxd.c Outdated
@@ -2120,6 +2125,7 @@ static int pxd_control_release(struct inode *inode, struct file *file)
pxd_printk("%s: not opened\n", __func__);
} else {
ctx->fc.connected = 0;
wmb();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a barrier inside schedule_delayed_work this is redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wmb even if redundant assures of the modified value.

pxd.c Outdated
/* Let other threads see the value of allow_disconnected. */
synchronize_rcu();
wmb();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wmb by itself does not do much.

synchronize_rcu() has a stronger memory barrier guarantee. By the time synchronize_rcu() returns all threads including this one has done a memory barrier and will see the updated value of allow_disconnected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not happen. Earlier the usage was with synchronize_rcu() and read_[lock|unlock]_rcu. It did not ensure updated values get reflected in the IO enqueue path and that was the reason new requests got enqueued while abort handling was running. Also access to variables without READ/WRITE_ONCE macros allows those access to be reordered during execution.

In any case, with the posted changes the original issue of invalid index and assertion is not seen anymore.

@sulakshm
Copy link
Contributor Author

The race seems to be calling fuse_end_queued_requests() while the allocation is in progress so they end up operating on the same request. Just moving the check for disconnected before allocation should be enough.

How could this race be open if the values of allow_disconnected is consistent? This is precisely addressed by wmb().
rcu lock/synchronize semantics did not address this.

pxd.c Outdated
@@ -993,7 +994,8 @@ static void pxd_rq_fn(struct request_queue *q)
break;

/* Filter out block requests we don't understand. */
if (BLK_RQ_IS_PASSTHROUGH(rq)) {
if (BLK_RQ_IS_PASSTHROUGH(rq) ||
(!fc->allow_disconnected && !fc->connected)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early return based on allow_disconnected and connected state. This ensures no allocation and immediate freeing of requests happen later.

pxd.c Outdated

if (BLK_RQ_IS_PASSTHROUGH(rq))
if (BLK_RQ_IS_PASSTHROUGH(rq) ||
(!fc->allow_disconnected && !fc->connected))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early return based on allow_disconnected and connected state. This ensures no allocation and immediate freeing of requests happen later.

use READ/WRITE_ONCE macro to work with connected/allow_disconnected var

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Copy link
Contributor Author

@sulakshm sulakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue not seen even with posted changes.

/*
* Ensures checking the value of allow_disconnected and adding request to
* queue is done atomically.
*/
rcu_read_lock();

if (force) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only allow queueing request to userspace if either of connected or allow_disconnected is true, even for fastpath requests. So removed force option.

{
req->in.unique = fuse_get_unique(fc);
fc->request_map[req->in.unique & (FUSE_MAX_REQUEST_IDS - 1)] = req;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified queue_request to also assign unique ids. queue_request should get called only if userspace enqueue conditions are met.

dev.c Outdated
}
rcu_read_unlock();
} else if (fc->connected || fc->allow_disconnected) {
if (READ_ONCE(fc->connected) || READ_ONCE(fc->allow_disconnected)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified all connected and allow_disconnected to use READ/WRITE_ONCE macros so no compiler optimizations/reordering happens.

@@ -222,24 +231,15 @@ static void request_end(struct fuse_conn *fc, struct fuse_req *req,
if (shouldfree) fuse_request_free(req);
}

void fuse_request_send_nowait(struct fuse_conn *fc, struct fuse_req *req, bool force)
void fuse_request_send_nowait(struct fuse_conn *fc, struct fuse_req *req)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only allow queueing request to userspace if either of connected or allow_disconnected is true, even for fastpath requests. So removed force option.

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
@@ -141,7 +141,13 @@ static void fuse_put_unique(struct fuse_conn *fc, u64 uid)
{
struct fuse_per_cpu_ids *my_ids;
int num_free;
int cpu = get_cpu();
int cpu;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to call put without get unique id, handle it.

Copy link

@maxkozlovsky maxkozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok,
note though that READ_ONCE() does not actually inhibit reordering. READ_ONCE() is equivalent to

std::atomic_load(&val, std::memory_order_relaxed);

Any other loads and stores can be reordered around it. It does guarantee that the value will be loaded once and only once. If READ_ONCE() happens in a loop, it would be done once per loop iteration.

pxd.c Outdated

if (BLK_RQ_IS_PASSTHROUGH(rq))
if (BLK_RQ_IS_PASSTHROUGH(rq) ||
(!READ_ONCE(fc->allow_disconnected) && !READ_ONCE(fc->connected)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the duplicate check necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fc->connected can be killed, checking just for allow_disconnected is sufficient.

Lakshmi Narasimhan Sundararajan added 3 commits February 25, 2021 23:50
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
@sulakshm
Copy link
Contributor Author

It's ok,
note though that READ_ONCE() does not actually inhibit reordering. READ_ONCE() is equivalent to

std::atomic_load(&val, std::memory_order_relaxed);

Any other loads and stores can be reordered around it. It does guarantee that the value will be loaded once and only once. If READ_ONCE() happens in a loop, it would be done once per loop iteration.

introducing READ_ONCE() ensures that these variables have concurrent updates and values do not get cached before the point of access. A necessity for rcu critical section to work, without which compiler can cache/reorder access of vars as needed.

@sulakshm sulakshm merged commit 07f1b44 into v2.7.0 Feb 25, 2021
sulakshm added a commit that referenced this pull request Feb 26, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
sulakshm added a commit that referenced this pull request Feb 26, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
sulakshm added a commit that referenced this pull request Mar 12, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
prabirpaul pushed a commit that referenced this pull request Mar 12, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
sulakshm added a commit that referenced this pull request Mar 12, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
prabirpaul added a commit that referenced this pull request Mar 12, 2021
prabirpaul pushed a commit that referenced this pull request Mar 12, 2021
* address race between restart and new req processing

Signed-off-by: Lakshmi Narasimhan Sundararajan <lns@portworx.com>
prabirpaul added a commit that referenced this pull request Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants