Skip to content

Conversation

xdustinface
Copy link

This is inspired by this CI test failure https://gitlab.com/dashpay/dash/-/jobs/614306340.

p2p-instantsend.py failed with a Segmentation fault

https://gitlab.com/dashpay/dash/-/jobs/614306340#L3201

node5 2020-06-28 03:20:00.184954 (mocktime: 2014-12-04 17:17:21) Posix Signal: Segmentation fault
  0#: (0x56BC7A5A) stl_vector.h:466                  - std::vector<unsigned long long, std::allocator<unsigned long long> >::operator=(std::vector<unsigned long long, std::allocator<unsigned long long> >&&)
  1#: (0x56BC7A5A) stacktraces.cpp:779               - HandlePosixSignal
  2#: (0xF76DEBC0) <unknown-file>                    - ???
  3#: (0xF76B6B06) <unknown-file>                    - ???
  4#: (0x56AF007B) mutex:110                         - std::recursive_mutex::lock()
  5#: (0x56AF007B) sync.h:59                         - AnnotatedMixin<std::recursive_mutex>::lock()
  6#: (0x56AF007B) std_mutex.h:267                   - std::unique_lock<CCriticalSection>::lock()
  7#: (0x56AF007B) sync.h:128                        - CCriticalBlock::Enter(char const*, char const*, int)
  8#: (0x56AF007B) sync.h:149                        - CCriticalBlock::CCriticalBlock(CCriticalSection&, char const*, char const*, int, bool)
  9#: (0x56AF007B) net.h:294                         - ForEachNode<CConnman::CFullyConnectedOnly, llmq::CDKGSession::RelayInvToParticipants(const CInv&) const::<lambda(CNode*)>&>
 10#: (0x56AF007B) net.h:304                         - ForEachNode<llmq::CDKGSession::RelayInvToParticipants(const CInv&) const::<lambda(CNode*)> >
 11#: (0x56AF007B) quorums_dkgsession.cpp:1314       - llmq::CDKGSession::RelayInvToParticipants(CInv const&) const
 12#: (0x56AF2B49) std_function.h:694                - function<llmq::CDKGSession::ReceiveMessage(const uint256&, const llmq::CDKGContribution&, bool&)::<lambda(llmq::CDKGDebugMemberStatus&)> >
 13#: (0x56AF2B49) quorums_dkgsession.cpp:288        - llmq::CDKGSession::ReceiveMessage(uint256 const&, llmq::CDKGContribution const&, bool&)
 14#: (0x56B42EA3) quorums_dkgsessionhandler.cpp:485 - bool llmq::ProcessPendingMessageBatch<llmq::CDKGContribution, 23>(llmq::CDKGSession&, llmq::CDKGPendingMessages&, unsigned int)
 15#: (0x56B30914) std_function.h:303                - _M_invoke
 16#: (0x56B2B1BD) quorums_dkgsessionhandler.cpp:204 - llmq::CDKGSessionHandler::WaitForNextPhase(llmq::QuorumPhase, llmq::QuorumPhase, uint256 const&, std::function<bool ()> const&)
 17#: (0x56B2C17E) atomic_base.h:396                 - std::__atomic_base<unsigned long long>::load(std::memory_order) const
 18#: (0x56B2C17E) util.h:154                        - LogAcceptCategory
 19#: (0x56B2C17E) quorums_dkgsessionhandler.cpp:322 - llmq::CDKGSessionHandler::HandlePhase(llmq::QuorumPhase, llmq::QuorumPhase, uint256 const&, double, std::function<void ()> const&, std::function<bool ()> const&)
 20#: (0x56B2E637) std_function.h:275                - std::_Function_base::~_Function_base()
 21#: (0x56B2E637) std_function.h:389                - std::function<void ()>::~function()
 22#: (0x56B2E637) quorums_dkgsessionhandler.cpp:546 - llmq::CDKGSessionHandler::HandleDKGRound()
 23#: (0x56B2EC58) quorums_dkgsessionhandler.cpp:586 - llmq::CDKGSessionHandler::PhaseHandlerThread()
 24#: (0x56B2F376) thread:186                        - _M_run
 25#: (0x571877CD) <unknown-file>                    - ???
 26#: (0xF76B43BD) <unknown-file>                    - ???
 27#: (0xF74A0E16) <unknown-file>                    - ???

My understanding of this issue:

phaseHandlerThread.join() is only called in the destructor of CDKGSessionHandler.

CDKGSessionHandler::~CDKGSessionHandler()
{
stopRequested = true;
if (phaseHandlerThread.joinable()) {
phaseHandlerThread.join();
}
}

This happens here

llmq::DestroyLLMQSystem();

which is after the networking threads have been stopped already which happens some lines above due to

g_connman.reset();

So it seems like if CDKGSessionHandler::PhaseHandlerThread has pending messages its possible that the code before the next ShutdownRequested() check ends up in CDKGSession::ReceiveMessage where it tries to relay a message even though the networking threads are stopped already. Unfortunately i couldn't figure the exact line where the segmentation fault happens but as trace ends with a LOCK after ForEachNode<llmq::CDKGSession::RelayInvToParticipants(const CInv&) const::<lambda(CNode*)> > it does probably end somewhere in here

g_connman->ForEachNode([&](CNode* pnode) {
bool relay = false;
if (pnode->qwatch) {
relay = true;
} else if (!pnode->verifiedProRegTxHash.IsNull() && relayMembers.count(pnode->verifiedProRegTxHash)) {
relay = true;
}
if (relay) {
pnode->PushInventory(inv);
}

If i would have to choose one it would be

pnode->PushInventory(inv);

If someone has an exact line and the explanation why please let me know!

No matter what line it is.. i would say this just happens because the dkg threads do run longer than the networking threads and this will be fixed with this PR due to an earlier join() for all dkg threads which happens here

llmq::StopLLMQSystem();
before stopping the network threads.

Questions i still have now

  1. Which line is it and why?
  2. What is the point of the thread pool (messageHandlerPool) which i removed from CDKGSessionManager in f3db650? For me this looked like a leftover or some unfinished stuff unless this is some crazy magic which has an impact which i can't see yet. So i removed it. If there is some reason to revoke this please tell me!

@xdustinface
Copy link
Author

Oh, while looking at the open PRs i just noticed this one here seems to replace #3296 as it solves the issue pointed out there.

Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. One suggestion, pls see c25a37c

@UdjinM6 UdjinM6 requested a review from codablock July 13, 2020 20:49
UdjinM6
UdjinM6 previously approved these changes Jul 13, 2020
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly tested ACK, seems to be working with no issues. Would really like to hear @codablock's thoughts to make sure we don't throw the baby out with the bathwater here somehow :)

@UdjinM6 UdjinM6 added this to the 16 milestone Jul 13, 2020
@codablock
Copy link

Concept ACK. Just one review comment :)

Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK, waiting for #3609 merged and rebase

@xdustinface xdustinface force-pushed the pr-llmq-dkg-threads branch from 1bb2c61 to ef66d06 Compare July 14, 2020 22:07
@xdustinface
Copy link
Author

Rebased!

llmqDb(_llmqDb),
blsWorker(_blsWorker)
{
for (const auto& qt : Params().GetConsensus().llmqs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's this called qt? maybe quorum or quorumThread would be clearer (I'm not actually sure what exactly this variable represents)

Copy link
Author

@xdustinface xdustinface Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why its called like that i just moved it around a bit. But it represents a key/value pair from the Consensus::Params::llmqs map which contains LLMQType, LLMQParams pairs for the available LLMQs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make it not auto so it's clearer what this is referring to. IMO we should only really be using auto where by the name of it and the surrounding code it's obvious. I guess it's obvious in the sense you can look at llmqs and find what that is...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess it's obvious in the sense you can look at llmqs and find what

Hm.. so for me that auto style isn't an issue at all because the declaration of llmqs is only a click away. But thats obviously a personal preference thing and at the end anyway kind of unrelated to this PR though because it's used like this all over the place :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's fine how it is. Just makes code review easier when it's obvious what variables mean

Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@PastaPastaPasta PastaPastaPasta requested a review from UdjinM6 July 15, 2020 17:38
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@UdjinM6 UdjinM6 merged commit d157718 into dashpay:develop Jul 16, 2020
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Aug 23, 2020
dashpay#3601)

* llqm: Fix thread handling in CDKGSessionManager and CDKGSessionHandler

* llmq: Removed unused thread_pool from CDKGSessionManager

* Tweak `CDKGSessionHandler::StartThread()`

* llmq: Simplify CDKGSessionHandler's thread naming

* llmq: Make sure CDKGSessionHandler uses a valid LLMQ type

Co-Authored-By: UdjinM6 <UdjinM6@users.noreply.github.com>

Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>
@PastaPastaPasta
Copy link
Member

backported in #3670

gades pushed a commit to cosanta/cosanta-core that referenced this pull request Mar 11, 2022
dashpay#3601)

* llqm: Fix thread handling in CDKGSessionManager and CDKGSessionHandler

* llmq: Removed unused thread_pool from CDKGSessionManager

* Tweak `CDKGSessionHandler::StartThread()`

* llmq: Simplify CDKGSessionHandler's thread naming

* llmq: Make sure CDKGSessionHandler uses a valid LLMQ type

Co-Authored-By: UdjinM6 <UdjinM6@users.noreply.github.com>

Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants