-
Notifications
You must be signed in to change notification settings - Fork 851
TS-4916: Fix for an H2-infinite-loop deadlock #1100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/876/ for details. |
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/984/ for details. |
|
Just a heads up, but this PR is against 6.2.x branch, the code has diverged, so will need one PR for 6.2.x and one for master. @gtenev Please make a PR for master at your earliest convenience as well, but sounds like they will be so different that we need two reviews. |
|
Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/878/ for details. |
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/986/ for details. |
|
Try again [approve ci]. |
|
Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/879/ for details. |
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/987/ for details. |
will commit as part of TS-4935
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/990/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/882/ for details. |
iocore/aio/.diags.log.meta
Outdated
| @@ -0,0 +1 @@ | |||
| creation_time = 1476307057 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, committed by mistake
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1000/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/892/ for details. |
| if (deleted_from_active_streams(stream)) { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added this check to 7.0 via the TS-4813 fix. Though as a direct "if (!stream_list.in(stream))" check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, needed that for calling delete_stream() on already deleted streams for my “catch-all-delete-stream” calls (6.2.1).
Also grouped some of the DLL<> + client_streams_count updates in separate functions.
Looking at how the master changed since I started working on this I may need to revert that to be consistent with master.
| } | ||
|
|
||
| SCOPED_MUTEX_LOCK(lock, this->mutex, this_ethread()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to grab a lock here? Given the mutex sharing between vc and session and ConnectionState, we should already be holding this mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are sure DLL<> is always protected by a lock then I must have really misunderstood this previous comment on TS-4916 where we suspected DLL<> to be “manipulated by simultaneous threads”.
That would mean that at least in one thread was not holding the right lock. In that case would not that mean that “rearranging some of the stream_count book keeping” would rather hide the problem than to fix it?
Trying to help based on the comment decided to trace few paths in the source code that may not be holding ConnectionState lock (theoretically) and grabbed the lock on the common path closest to the structures that needed protection (which based on my understanding should not be a problem if the thread is already holding the lock).
Actually never noticed the race condition in my debugging so I am going to remove this line from the PR and I will consider limiting my future changes to the issue I am trying to fix.
| { | ||
| if (stream->get_state() == HTTP2_STREAM_STATE_CLOSED) { | ||
| this->delete_stream(stream); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we leaking a stream without a delete call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I may have overdone it here.
I thought I would add this deletion because few lines below there is deleting if state == HTTP2_STREAM_STATE_CLOSED. send_data_frames() is called not only from do_io_close(), and calling delete_stream() on already deleted stream is OK (with this change).
| static_cast<Http2ClientSession *>(parent)->connection_state.send_data_frames(this); | ||
|
|
||
| // Make sure the stream is deleted at this point since next step is self destroy. | ||
| this->delete_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to me that this delete_stream is necessary either. The parent is notified to release the stream via the "static_cast<Http2ClientSession *>(parent)->connection_state.release_stream(this);" in the Http2Stream handler. But that might be just 7.0. There are some fixes in this area that are marked for backport to 6.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what actually made sure we delete the stream from the DLL<> before triggering destroy() (before leaving do_io_close()).
In the version 6.2.1 we have send_data_frames() delete the stream from DLL<> on HTTP2_STREAM_STATE_CLOSED.
Then in a later version we added HTTP2_STREAM_STATE_HALF_CLOSED_LOCAL.
What caused the destroying of DLL<> in our case was HTTP2_STREAM_STATE_HALF_CLOSED_REMOTE (which does not cause deletion in any version).
It seemed to me we have been always vulnerable to this problem despite the fixes. That is why I thought I would add this “catch-all-delete-stream” line here for all the current and future missing states. After this point we are going to destroy() the stream regardless of its state anyway.
|
Not clear we want to move this to 7.0. The main changes have already been applied (check to see if stream is still in list before deleting and appropriately signal that the stream should be removed from the Session/ConnState stream list). May want to look at the backport of TS-4507 to 6.2. |
|
I was actually thinking to propose just adding a “catch-all-delete-stream” call in It seems to me that the point of this Jira TS-4813 / this PR got somehow lost so I would like to reiterate. Missing a stream deletion from Host became dysfunctional pretty often because of this (time to live 1-3 days per host) and it took long time to debug/prove. This is my (practical?) attempt to make sure we don’t get stuck in this debugging again regardless of past/current/future “delete stream” bugs (as long us we use I wanted to propose a "catch-all-delete-stream" safety net in both 6.2.1 and 7.0.0 because I think it would be hard to guarantee that the next commit would not introduce this weakness again. This is my first read of the H2 code so I can see how the proposed fix can be sub-optimal and will gladly change it, I also hope I provided enough reasoning for best results :). I am pretty confident in my debugging data/conclusions but just by reading the code or by sorting through Jiras that “may help” I can only guess if the new 7.0 version or back-ports to 6.2.1 definitely fix the issue until I test/verify. We could close this PR:
Cheers! |
|
Chatted with @shinrich offline and she is going to mark related fixes TS-4813 and TS-4507 for backport from 7.0.0 to 6.2.1, which should take care of the "missing to delete the stream cases". Submitted a new PR against master #1117 to add the "safety net" delete stream call in |
This updates our Catch2 version from v2.13.8 to the latest v3.9.1. It also transitions us from the deprecated monolithic header file to the new library model using CMake via FetchContent. This has several advantages: * This uses the library model which the authors of Catch2 advocate for. The monolithic header we used before this patch is not "the primarily supported option". * Since not all of catch.hpp is built for every UT, compilation times are quicker. * Since the new v3 version provides a main function definition via Catch2::Catch2WithMain, we no longer need the empty unit_test_main.cc files we had before. * The updated Catch v3 is purported to execute its checks more efficiently (according to the changelog). As a part of this transition, this removes src/records/unit_tests/unit_test_main_on_eventsystem.cc since that was added with apache#7704 for DynamicStats which has since been removed. In v2, this simply didn't run any tests. In v3, this failed because no tests were run. (cherry picked from commit 36f9095) Co-authored-by: Brian Neradt <brian.neradt@gmail.com>
This is a fix to prevent destroying of the DLL<> structure and the following iteration over Http2ConnectionState::stream_list to fall into an infinite loop while holding a lock, which leads to cache updates to start failing.