Skip to content

Conversation

@spoonincode
Copy link
Contributor

@spoonincode spoonincode commented Apr 25, 2025

A state_history session runs two corountines: one to read from the websocket and one to write to the websocket. A failure on either corountine will cause it to complete as well as indicate to the other countinue that it should complete as well. Completion of either corountine ends up here,

void check_coros_done(std::exception_ptr e) {
//the only exception that should have bubbled out of the coros is a bad_alloc, bubble it up further. No need to bother
// with the rest of the cleanup: we'll be shutting down soon anyway due to bad_alloc
if(e)
std::rethrow_exception(e);
//coros always return on the session's strand
if(--coros_running == 0)
on_done(this);
}

Since the timing of those completions can be indeterminate, a simple counter is used to detect that both have completed. When that occurs session indicates externally that it is done and it's ready to be destroyed. "done" means the session is no longer using any of its members such as the stream, etc -- sounds safe to destroy right? This done indication winds up here,
[this](session_base* conn) {
app().executor().post(priority::high, exec_queue::read_write, [conn, this]() {
connections.erase(connections.find(conn));
});

Where a small piece of work is submitted to the main thread to remove the session from connections.

While this erasure is queued up on the main thread, the main thread may be doing other work that results in the on_accepted_block() signal firing. This will cause state_history to,

for(const std::unique_ptr<session_base>& c : connections)
c->block_applied(block->block_num());
}

Remember the erasure is still queued up on the main thread, so the "done" session is still in this list. block_applied() is very simple and calls awake_if_idle()
void block_applied(const chain::block_num_type applied_block_num) {
//indicates a fork being applied for already-sent blocks; rewind the cursor
if(applied_block_num < next_block_cursor)
next_block_cursor = applied_block_num;
awake_if_idle();
}

void awake_if_idle() {
boost::asio::dispatch(strand, [this]() {
wake_timer.cancel_one();
});
}

Critically, awake_if_idle() post()s to the "done" session's strand.

Now 1 of 3 things can happen based on timing:

  1. The wake_timer.cancel_one() posted to the "done" session's strand completes prior to the erasure on main thread. This is fine.
  2. The erasure occurs on the main thread prior to the strand starting wake_timer.cancel_one(). This is also fine -- the strand is destroyed and any pending work is dropped.
  3. The erasure occurs on the main thread simultaneously to the strand executing wake_timer.cancel_one() (or also during any phase where the strand's internal code is otherwise running). This will result in undefined behavior as the session and its members (including the strand) will be in various stages of destruction.

To resolve this problem, can drain any potential pending queued wake_timer.cancel_one() prior to destruction. Since we're blocking the main thread during draining, we can be sure no new calls to block_applied() occur.

The premise above points to the same line that #1419 fails on, so will mark this resolving it.

// allow main thread to drain the strand before destruction -- some awake_if_idle() post()s may be inflight
void drain_strand() {
boost::asio::post(strand, boost::asio::use_future([](){})).get();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be post_async_task() on 1.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,
It might look like we can just do this in the dtor and avoid a special call (which is certainly ugly), but we can't: during nodeos shutdown the thread pools will be stopped before destruction takes place. In that case, trying to drain the strand will never complete since the strand is stopped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the nice one-liner.

@spoonincode spoonincode linked an issue Apr 25, 2025 that may be closed by this pull request
// allow main thread to drain the strand before destruction -- some awake_if_idle() post()s may be inflight
void drain_strand() {
boost::asio::post(strand, boost::asio::use_future([](){})).get();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the nice one-liner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

State history race on disconnect

5 participants