Skip to content

Raft deadlock #3147

Closed
Closed
@kikimo

Description

@kikimo

Describe the bug (must be provided)

Raft deadlock after stress test.

Your Environments (must be provided)

  • OS: uname -a
  • Compliler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id 650776b (v2.6.0)

How To Reproduce(must be provided)

Steps to reproduce the behavior:

  1. 3store + 1meta + 1graph
  2. keep triggering leader change and start 1024 concurrent clients sending add edge request
  3. run for about two minutes

stop the stresser, we see one storage instance offline:

(root@nebula) [(none)]> show hosts;
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| Host            | Port  | Status    | Leader count | Leader distribution  | Partition distribution |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 59007 | "ONLINE"  | 1            | "ttos_3p3r:1"        | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 37549 | "ONLINE"  | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 56039 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "Total"         |       |           | 1            | "ttos_3p3r:1"        | "ttos_3p3r:3"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
Got 4 rows (time spent 1249/11228 us)

Tue, 19 Oct 2021 16:23:00 CST

pstack shows that raft of that instance is blocked on raftLock_:

Thread 61 (Thread 0x7f4d317ff700 (LWP 3413758) "IOThreadPool11"):
#0  0x00007f4d531c354d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4d531bee9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f4d531bed68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000002a3863a in __gthread_mutex_lock (__mutex=0x7f4d384be150) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/x86_64-vesoft-linux/bits/gthr-default.h:748
#4  0x0000000002a3893c in std::mutex::lock (this=0x7f4d384be150) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_mutex.h:103
#5  0x0000000002a3e41b in std::lock_guard<std::mutex>::lock_guard (this=0x7f4d317fbbb8, __m=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_mutex.h:162
#6  0x0000000003d557a0 in nebula::raftex::RaftPart::processHeartbeatRequest (this=0x7f4d384be010, req=..., resp=...) at /root/src/wwl/nebula/src/kvstore/raftex/RaftPart.cpp:1656
#7  0x0000000003d955dc in nebula::raftex::RaftexService::async_eb_heartbeat (this=0x7f4d528323b0, callback=..., req=...) at /root/src/wwl/nebula/src/kvstore/raftex/RaftexService.cpp:220
#8  0x0000000003e2b971 in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::process_heartbeat<apache::thrift::CompactProtocolReader, apache::thrift::CompactProtocolWriter> (this=0x7f4d0491bae0, req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.tcc:220
#9  0x0000000003e2742a in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::setUpAndProcess_heartbeat<apache::thrift::CompactProtocolReader, apache::thrift::CompactProtocolWriter> (this=0x7f4d0491bae0, req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.tcc:198
#10 0x0000000003e294eb in apache::thrift::detail::ap::process_pmap<apache::thrift::CompactProtocolReader, nebula::raftex::cpp2::RaftexServiceAsyncProcessor> (proc=0x7f4d0491bae0, pmap=..., req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/third-party/install/include/thrift/lib/cpp2/GeneratedCodeHelper.h:891
#11 0x0000000003e266f3 in apache::thrift::detail::ap::process<nebula::raftex::cpp2::RaftexServiceAsyncProcessor> (processor=0x7f4d0491bae0, req=..., serializedRequest=..., protType=apache::thrift::protocol::T_COMPACT_PROTOCOL, ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/third-party/install/include/thrift/lib/cpp2/GeneratedCodeHelper.h:919
#12 0x0000000003e2539e in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::processSerializedRequest (this=0x7f4d0491bae0, req=..., serializedRequest=..., protType=apache::thrift::protocol::T_COMPACT_PROTOCOL, context=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.cpp:102
#13 0x00000000042f8644 in apache::thrift::Cpp2Connection::requestReceived(std::unique_ptr<apache::thrift::HeaderServerChannel::HeaderRequest, std::default_delete<apache::thrift::HeaderServerChannel::HeaderRequest> >&&) ()
#14 0x00000000043042cc in apache::thrift::HeaderServerChannel::messageReceived(std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >&&, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> >&&) ()
#15 0x00000000042c54f6 in apache::thrift::Cpp2Channel::read(wangle::HandlerContext<int, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*> >*, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#16 0x00000000042d2c73 in wangle::ContextImpl<apache::thrift::Cpp2Channel>::read(std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#17 0x00000000042d2b65 in wangle::ContextImpl<apache::thrift::FramingHandler>::fireRead(std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#18 0x00000000042d37e5 in apache::thrift::FramingHandler::read(wangle::HandlerContext<std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >*, folly::IOBufQueue&) ()
#19 0x00000000042cb56d in non-virtual thunk to wangle::ContextImpl<apache::thrift::FramingHandler>::read(folly::IOBufQueue&) ()
#20 0x00000000042cb450 in wangle::ContextImpl<apache::thrift::TAsyncTransportHandler>::fireRead(folly::IOBufQueue&) ()
#21 0x00000000045db8ff in folly::AsyncSocket::handleRead() ()
#22 0x00000000045d2a00 in folly::AsyncSocket::ioReady(unsigned short) ()
#23 0x00000000046c284f in ?? ()
#24 0x00000000046c316f in event_base_loop ()
#25 0x00000000045ec7fe in folly::EventBase::loopBody(int, bool) ()
#26 0x00000000045ecca6 in folly::EventBase::loop() ()
#27 0x00000000045ee756 in folly::EventBase::loopForever() ()
#28 0x0000000004578389 in folly::IOThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
#29 0x0000000004585f59 in void folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()
#30 0x0000000002a3e85c in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f4d33410390) at /root/src/wwl/nebula/build/third-party/install/include/folly/Function.h:400
#31 0x0000000002a4ea06 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0x7f4d33410390) at /root/src/wwl/nebula/build/third-party/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#32 0x0000000002a60400 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:60
#33 0x0000000002a5994a in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__fn=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:95
#34 0x0000000002aa52c2 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f4d33410390) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:234
#35 0x0000000002aa47b7 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f4d33410390) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:243
#36 0x0000000002aa1a54 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f4d33410380) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:186
#37 0x0000000004b4043f in execute_native_thread_routine ()
#38 0x00007f4d531bcea5 in start_thread () from /lib64/libpthread.so.0
#39 0x00007f4d52ee59fd in clone () from /lib64/libc.so.6
Thread 60 (Thread 0x7f4d325ff700 (LWP 3413757) "IOThreadPool10"):

strace shows that thread 3413758(the thread raft task resides) is blocked on futex forever:

# strace -p 3413758
strace: Process 3413758 attached
futex(0x7f4d384be150, FUTEX_WAIT_PRIVATE, 2, NULL

full pstack output is provided below:

3413698.txt

Expected behavior

all cluster peers still running. healthly。

Additional context

Provide logs and configs, or any other context to trace the problem.

Metadata

Metadata

Labels

type/bugType: something is unexpected

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions