Description
Please check the FAQ documentation before raising an issue
Describe the bug (required)
Extra parts found after performing balance operation. We have a space with 512part * 3replicas = 1536parts:
(root@nebula) [ttos_3p3r]> show hosts; [529/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:206" |
| "store2" | 9779 | "ONLINE" | 24 | "ttos_3p3r:24" | "ttos_3p3r:340" |
| "store3" | 9779 | "ONLINE" | 303 | "ttos_3p3r:303" | "ttos_3p3r:319" |
| "store4" | 9779 | "ONLINE" | 99 | "ttos_3p3r:99" | "ttos_3p3r:329" |
| "store5" | 9779 | "ONLINE" | 86 | "ttos_3p3r:86" | "ttos_3p3r:342" |
| "Total" | | | 512 | "ttos_3p3r:512" | "ttos_3p3r:1536" |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7345/7948 us)
Thu, 06 Jan 2022 15:33:25 CST
after some balance operation we found that 1536parts grows into 1537parts and 1538parts:
(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:175" |
| "store2" | 9779 | "ONLINE" | 81 | "ttos_3p3r:81" | "ttos_3p3r:342" |
| "store3" | 9779 | "ONLINE" | 31 | "ttos_3p3r:31" | "ttos_3p3r:330" |
| "store4" | 9779 | "ONLINE" | 307 | "ttos_3p3r:307" | "ttos_3p3r:336" |
| "store5" | 9779 | "ONLINE" | 93 | "ttos_3p3r:93" | "ttos_3p3r:354" |
| "Total" | | | 512 | "ttos_3p3r:512" | "ttos_3p3r:1537" |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 6975/7626 us)
Thu, 06 Jan 2022 16:24:45 CST
(root@nebula) [ttos_3p3r]> show hosts; [259/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:153" |
| "store2" | 9779 | "ONLINE" | 114 | "ttos_3p3r:114" | "ttos_3p3r:349" |
| "store3" | 9779 | "ONLINE" | 3 | "ttos_3p3r:3" | "ttos_3p3r:335" |
| "store4" | 9779 | "ONLINE" | 275 | "ttos_3p3r:275" | "ttos_3p3r:342" |
| "store5" | 9779 | "ONLINE" | 120 | "ttos_3p3r:120" | "ttos_3p3r:359" |
| "Total" | | | 512 | "ttos_3p3r:512" | "ttos_3p3r:1538" |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7809/8397 us)
Thu, 06 Jan 2022 16:39:33 CST
and we can also found that two storages are offline later:
(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:120" |
| "store2" | 9779 | "ONLINE" | 24 | "ttos_3p3r:24" | "ttos_3p3r:361" |
| "store3" | 9779 | "ONLINE" | 321 | "ttos_3p3r:321" | "ttos_3p3r:347" |
| "store4" | 9779 | "OFFLINE" | 0 | "No valid partition" | "ttos_3p3r:354" |
| "store5" | 9779 | "ONLINE" | 128 | "ttos_3p3r:128" | "ttos_3p3r:356" |
| "Total" | | | 473 | "ttos_3p3r:473" | "ttos_3p3r:1538" |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 11798/12566 us)
Thu, 06 Jan 2022 16:58:36 CST
and we found that the two offline storaged are still running with very high CPU, store1 CPU usage:
top - 11:31:08 up 8 days, 23:16, 1 user, load average: 112.18, 108.17, 107.48
Tasks: 8 total, 2 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 58.9 us, 15.0 sy, 0.0 ni, 23.9 id, 0.0 wa, 0.0 hi, 2.2 si, 0.0 st
MiB Mem : 257583.7 total, 88205.3 free, 76023.6 used, 93354.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 179741.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14 root 20 0 9514504 109448 36152 R 3801 0.0 48245:06 nebula-storaged
1 root 20 0 1104 4 0 S 0.0 0.0 0:01.07 docker-init
7 root 20 0 3976 2960 2716 S 0.0 0.0 0:00.00 run-storage.sh
12 root 20 0 12176 4328 3408 S 0.0 0.0 0:00.00 sshd
15 root 20 0 2508 520 460 S 0.0 0.0 0:00.00 sleep
5602 root 20 0 13580 8868 7452 S 0.0 0.0 0:00.01 sshd
5613 root 20 0 4240 3444 2976 S 0.0 0.0 0:00.00 bash
5618 root 20 0 6180 3296 2752 R 0.0 0.0 0:00.00 top
store2 CPU usage:
top - 11:33:10 up 8 days, 23:18, 1 user, load average: 108.27, 107.95, 107.48
Tasks: 8 total, 2 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 61.4 us, 16.6 sy, 0.0 ni, 19.5 id, 0.0 wa, 0.0 hi, 2.5 si, 0.0 st
MiB Mem : 257583.7 total, 88130.5 free, 76069.1 used, 93384.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 179696.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14 root 20 0 9795.5m 596444 36496 R 3586 0.2 39474:19 nebula-storaged
1 root 20 0 1104 4 0 S 0.0 0.0 0:01.08 docker-init
7 root 20 0 3976 2996 2748 S 0.0 0.0 0:00.00 run-storage.sh
12 root 20 0 12176 4184 3264 S 0.0 0.0 0:00.00 sshd
15 root 20 0 2508 588 524 S 0.0 0.0 0:00.00 sleep
6604 root 20 0 13584 8892 7472 S 0.0 0.0 0:00.02 sshd
6615 root 20 0 4240 3412 2952 S 0.0 0.0 0:00.00 bash
6620 root 20 0 6180 3280 2732 R 0.0 0.0 0:00.00 top
pstack shows that a lot of thread within the storage are acquiring RWSpinlock:
root@store4:~# cat new.txt | grep -i spinlock
#3 0x000000000353ae1b in folly::RWSpinLock::lock_shared()+74 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:212
#4 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0 0x000000000353af9d in folly::RWSpinLock::try_lock_shared()+107 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:281
#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
we failed to enable storage log(we disable storage log when starting), the stderr.txt shows:
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045600.14!Could not create logging file: Too many open files
and the storage failed to exit after recieving SIGTERM, stack dump before issueing SIGTERM:
stack dump after issueing SIGTERM:
Your Environments (required)
- OS:
uname -a
- Compiler:
g++ --version
orclang++ --version
- CPU:
lscpu
- Commit id 1147e89
How To Reproduce(required)
Steps to reproduce the behavior:
- create a cluster with 3 storages
- create a space with 512parts * 3 replicas in the 3 storages
- add two new storges, balance data between all these 5storages, wait until the job finish
- balance remove one storage
and we find that the storage cluster go insane duiring the last step.
Expected behavior
Additional context
Activity