Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: avoid reset after end_stream #14106

Merged
merged 8 commits into from
Dec 1, 2020

Conversation

snowp
Copy link
Contributor

@snowp snowp commented Nov 19, 2020

Signed-off-by: Snow Pettersen snowp@lyft.com

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Risk Level: Medium
Testing: New UT
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Contributor Author

snowp commented Nov 19, 2020

@alyssawilk This attempts to fix the issue with double ends that were happening in one of the integration tests in #13095 that I mentioned to you over Slack a while ago. I wasn't able to figure out why we get a end_stream=true and a reset, but at least this prevents it from reaching the router. LMK what you think

@snowp
Copy link
Contributor Author

snowp commented Nov 23, 2020

Interestingly this seems to trigger an ASAN heap-use-after-free when I try to modify the upstream_request_ field, indicating that the TcpUpstream keeps getting onUpstreamData callbacks after it has been destroyed. Will try to figure out how this happens

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this (and for figuring out asan!)
One thought I think isn't covered by tests today but I'd rather fix in advance...

@@ -87,6 +87,11 @@ void TcpUpstream::resetStream() {

void TcpUpstream::onUpstreamData(Buffer::Instance& data, bool end_stream) {
upstream_request_->decodeData(data, end_stream);
// This ensures that if we get a reset after end_stream we won't propagate two
// "end streams" to the upstream_request_.
if (end_stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we have upstream filters, will it be possible to get an upstream "end stream "before we send headers? I think it could be the case, so I think we're going to want to do null pointer checks for upstream_request in few other places, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I'll go over all the functions

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@@ -45,6 +46,11 @@ void TcpUpstream::encodeData(Buffer::Instance& data, bool end_stream) {

Envoy::Http::Status TcpUpstream::encodeHeaders(const Envoy::Http::RequestHeaderMap&,
bool end_stream) {
if (!upstream_request_) {
// TODO(snowp): Should this return something else in this case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, seems like an error case to me.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
// Verifies that if we send headers after the upstream has been reset nothing crashes.
TEST_F(TcpUpstreamTest, HeadersAfterReset) {
tcp_upstream_->onEvent(Network::ConnectionEvent::LocalClose);
EXPECT_TRUE(tcp_upstream_->encodeHeaders(request_, false).ok());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this fail given the status error?

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for driving this one though!

@snowp snowp merged commit 1c06967 into envoyproxy:master Dec 1, 2020
@@ -480,8 +480,8 @@ void UpstreamRequest::clearRequestEncoder() {
// Before clearing the encoder, unsubscribe from callbacks.
if (upstream_) {
parent_.callbacks()->removeDownstreamWatermarkCallbacks(downstream_watermark_manager_);
parent_.callbacks()->dispatcher().deferredDelete(std::move(upstream_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by: @snowp @alyssawilk did you check carefully that the destructor of the various upstreams isn't referencing anything that can go away? We have had subtle bugs in this area and in general the use of deferred delete can be a little precarious.

Also, question: is this change defensive or is there a real case that causes this? It seems like we shouldn't be calling reset at all after we receive end stream? Why doesn't end_stream cause upstream_ to get nulled out, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original motivation for this PR was to fix an issue I ran into in #13095 where a reset would follow an end_stream=true callback and both of these events would be propagated to the router/UpstreamRequest. This ran into an issue when resetting the upstream_request_ as part of the router, hence the deferred deletions. It does feel like there's something weird going on here, but this seemed to be the easiest fix to what was happening on the other PR. Happy to discuss more (and even revert if we think this is the wrong way to handle it).

The remaining changes here are defensive changes in anticipation of the upstream FM changes, at least I don't have a clear sense of when this would be a problem, maybe @alyssawilk does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: deferred delete, it's mainly just making sure the upstream can be deleted after the router part. Is there any reference in the destructor that might cause a timing issue? For example timers getting cancelled, etc. This can be somewhat subtle and needs to be audited.

re: the change itself, it does seem to me like this is not the correct fix, so we can at least open a tracking issue to try to figure out what state is not getting reset properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through both the HTTP and TCP upstream neither of them has a dtor that does anything but a null check, and the only thing that would be destructed as part of it is the ConnectionData handle. The TcpConnectionData seems to indicate that it's resilient to deferred deletion ordering so I suspect this is fine:

    ~TcpConnectionData() override {
      // Generally it is the case that TcpConnectionData will be destroyed before the
      // ActiveTcpClient. Because ordering on the deferred delete list is not guaranteed in the
      // case of a disconnect, make sure parent_ is valid before doing clean-up.
      if (parent_) {
        parent_->clearCallbacks();
      }
    }

I definitely had not done my due diligence here so, so thanks for pointing this out.

I'll file an issue and try to dig a bit deeper if I can find the time

mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 2, 2020
* master: (70 commits)
  upstream: avoid reset after end_stream in TCP HTTP upstream (envoyproxy#14106)
  bazelci: add fuzz coverage (envoyproxy#14179)
  dependencies: allowlist CVE-2020-8277 to prevent false positives. (envoyproxy#14228)
  cleanup: replace ad-hoc [0, 1] value types with UnitFloat (envoyproxy#14081)
  Update docs for skywalking tracer (envoyproxy#14210)
  Fix some errors in the switch statement when decode dubbo response (envoyproxy#14207)
  Windows: enable tests and envoy-static.exe pdb file (envoyproxy#13688)
  http: add Kill Request HTTP filter (envoyproxy#14170)
  dependencies: fix release_dates error behavior. (envoyproxy#14216)
  thrift filter: support skip decoding data after metadata in the thrift message (envoyproxy#13592)
  update cares (envoyproxy#14213)
  docs: clarify behavior of hedge_on_per_try_timeout (envoyproxy#12983)
  repokitteh: add support for randomized auto-assign. (envoyproxy#14185)
  [grpc] validate grpc config for illegal characters (envoyproxy#14129)
  server: Return nullopt when process_context is nullptr (envoyproxy#14181)
  [Windows] Fix thrift proxy tests (envoyproxy#13220)
  kafka: add missing unit tests (envoyproxy#14195)
  doc: mention gperftools explicitly in PPROF.md (envoyproxy#14199)
  Removed `--use-fake-symbol-table` option. (envoyproxy#14178)
  filter contract: clarification around local replies (envoyproxy#14193)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
snowp added a commit to snowp/envoy that referenced this pull request Dec 31, 2020
…nvoyproxy#14106)"

This reverts commit 1c06967.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
mattklein123 pushed a commit that referenced this pull request Jan 2, 2021
…14106)" (#14551)

This reverts commit 1c06967.

Signed-off-by: Snow Pettersen <snowp@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants