Skip to content

Conversation

@gtenev
Copy link
Contributor

@gtenev gtenev commented Oct 12, 2016

This is a fix to prevent destroying of the DLL<> structure and the following iteration over Http2ConnectionState::stream_list to fall into an infinite loop while holding a lock, which leads to cache updates to start failing.

@atsci
Copy link

atsci commented Oct 12, 2016

Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/876/ for details.

@atsci
Copy link

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/984/ for details.

@zwoop zwoop added HTTP/2 Backport Marked for backport for an LTS patch release labels Oct 12, 2016
@zwoop zwoop added this to the 6.2.1 milestone Oct 12, 2016
@zwoop
Copy link
Contributor

zwoop commented Oct 12, 2016

Just a heads up, but this PR is against 6.2.x branch, the code has diverged, so will need one PR for 6.2.x and one for master. @gtenev Please make a PR for master at your earliest convenience as well, but sounds like they will be so different that we need two reviews.

@atsci
Copy link

atsci commented Oct 12, 2016

Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/878/ for details.

@atsci
Copy link

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/986/ for details.

@zwoop
Copy link
Contributor

zwoop commented Oct 12, 2016

Try again [approve ci].

@atsci
Copy link

atsci commented Oct 12, 2016

Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/879/ for details.

@atsci
Copy link

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/987/ for details.

will commit as part of TS-4935
@atsci
Copy link

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/990/ for details.

@atsci
Copy link

atsci commented Oct 12, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/882/ for details.

@@ -0,0 +1 @@
creation_time = 1476307057
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, committed by mistake

@atsci
Copy link

atsci commented Oct 13, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1000/ for details.

@atsci
Copy link

atsci commented Oct 13, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/892/ for details.

if (deleted_from_active_streams(stream)) {
return;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added this check to 7.0 via the TS-4813 fix. Though as a direct "if (!stream_list.in(stream))" check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, needed that for calling delete_stream() on already deleted streams for my “catch-all-delete-stream” calls (6.2.1).

Also grouped some of the DLL<> + client_streams_count updates in separate functions.

Looking at how the master changed since I started working on this I may need to revert that to be consistent with master.

}

SCOPED_MUTEX_LOCK(lock, this->mutex, this_ethread());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to grab a lock here? Given the mutex sharing between vc and session and ConnectionState, we should already be holding this mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are sure DLL<> is always protected by a lock then I must have really misunderstood this previous comment on TS-4916 where we suspected DLL<> to be “manipulated by simultaneous threads”.

That would mean that at least in one thread was not holding the right lock. In that case would not that mean that “rearranging some of the stream_count book keeping” would rather hide the problem than to fix it?

Trying to help based on the comment decided to trace few paths in the source code that may not be holding ConnectionState lock (theoretically) and grabbed the lock on the common path closest to the structures that needed protection (which based on my understanding should not be a problem if the thread is already holding the lock).

Actually never noticed the race condition in my debugging so I am going to remove this line from the PR and I will consider limiting my future changes to the issue I am trying to fix.

{
if (stream->get_state() == HTTP2_STREAM_STATE_CLOSED) {
this->delete_stream(stream);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we leaking a stream without a delete call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I may have overdone it here.

I thought I would add this deletion because few lines below there is deleting if state == HTTP2_STREAM_STATE_CLOSED. send_data_frames() is called not only from do_io_close(), and calling delete_stream() on already deleted stream is OK (with this change).

static_cast<Http2ClientSession *>(parent)->connection_state.send_data_frames(this);

// Make sure the stream is deleted at this point since next step is self destroy.
this->delete_stream();
Copy link
Member

@shinrich shinrich Oct 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me that this delete_stream is necessary either. The parent is notified to release the stream via the "static_cast<Http2ClientSession *>(parent)->connection_state.release_stream(this);" in the Http2Stream handler. But that might be just 7.0. There are some fixes in this area that are marked for backport to 6.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what actually made sure we delete the stream from the DLL<> before triggering destroy() (before leaving do_io_close()).

In the version 6.2.1 we have send_data_frames() delete the stream from DLL<> on HTTP2_STREAM_STATE_CLOSED.

Then in a later version we added HTTP2_STREAM_STATE_HALF_CLOSED_LOCAL.

What caused the destroying of DLL<> in our case was HTTP2_STREAM_STATE_HALF_CLOSED_REMOTE (which does not cause deletion in any version).

It seemed to me we have been always vulnerable to this problem despite the fixes. That is why I thought I would add this “catch-all-delete-stream” line here for all the current and future missing states. After this point we are going to destroy() the stream regardless of its state anyway.

@shinrich
Copy link
Member

Not clear we want to move this to 7.0. The main changes have already been applied (check to see if stream is still in list before deleting and appropriately signal that the stream should be removed from the Session/ConnState stream list). May want to look at the backport of TS-4507 to 6.2.

@gtenev
Copy link
Contributor Author

gtenev commented Oct 14, 2016

I was actually thinking to propose just adding a “catch-all-delete-stream” call in destroy() before THREAD_FREE() in 7.0.

It seems to me that the point of this Jira TS-4813 / this PR got somehow lost so I would like to reiterate.

Missing a stream deletion from DLL<> before calling THREAD_FREE() COULD lead to breaking the DLL<> and then if that happens DLL<> traversals ALWAYS ends up in infinite loop.

Host became dysfunctional pretty often because of this (time to live 1-3 days per host) and it took long time to debug/prove.

This is my (practical?) attempt to make sure we don’t get stuck in this debugging again regardless of past/current/future “delete stream” bugs (as long us we use THREAD_ALLOC_INIT/THREAD_FREE + DLL<> of course).

I wanted to propose a "catch-all-delete-stream" safety net in both 6.2.1 and 7.0.0 because I think it would be hard to guarantee that the next commit would not introduce this weakness again.

This is my first read of the H2 code so I can see how the proposed fix can be sub-optimal and will gladly change it, I also hope I provided enough reasoning for best results :).

I am pretty confident in my debugging data/conclusions but just by reading the code or by sorting through Jiras that “may help” I can only guess if the new 7.0 version or back-ports to 6.2.1 definitely fix the issue until I test/verify.

We could close this PR:

  • If we are confident all “fixes in this area” from 7.0 can be back-ported to 6.2.1 and that the issue will be fixed
  • AND If adding the proposed safety net does not make any/enough sense

Cheers!

@gtenev
Copy link
Contributor Author

gtenev commented Oct 18, 2016

Chatted with @shinrich offline and she is going to mark related fixes TS-4813 and TS-4507 for backport from 7.0.0 to 6.2.1, which should take care of the "missing to delete the stream cases".

Submitted a new PR against master #1117 to add the "safety net" delete stream call in destroy() before THREAD_FREE() and backport to 6.2.1 (changes are not 6.2.1 specific anymore).

@gtenev gtenev closed this Oct 18, 2016
@zwoop zwoop modified the milestone: 6.2.1 May 4, 2017
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 23, 2026
This updates our Catch2 version from v2.13.8 to the latest v3.9.1. It
also transitions us from the deprecated monolithic header file to the
new library model using CMake via FetchContent.

This has several advantages:

* This uses the library model which the authors of Catch2 advocate for.
  The monolithic header we used before this patch is not "the primarily
  supported option".
* Since not all of catch.hpp is built for every UT, compilation times
  are quicker.
* Since the new v3 version provides a main function definition via
  Catch2::Catch2WithMain, we no longer need the empty unit_test_main.cc
  files we had before.
* The updated Catch v3 is purported to execute its checks more
  efficiently (according to the changelog).

As a part of this transition, this removes
src/records/unit_tests/unit_test_main_on_eventsystem.cc since that was
added with apache#7704 for DynamicStats which has since been removed. In v2,
this simply didn't run any tests. In v3, this failed because no tests
were run.

(cherry picked from commit 36f9095)

Co-authored-by: Brian Neradt <brian.neradt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport Marked for backport for an LTS patch release HTTP/2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants