Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: hds_integration_test failures with tsan #12184

Closed
jmarantz opened this issue Jul 20, 2020 · 19 comments
Closed

test: hds_integration_test failures with tsan #12184

jmarantz opened this issue Jul 20, 2020 · 19 comments

Comments

@jmarantz
Copy link
Contributor

jmarantz commented Jul 20, 2020

On master as of today, about 20% failure rate. They may have a common root cause but there are different classes of symptoms. E.g. here are 4 different ones: a short assert-failure and some very lengthy tsan reports that are in gist.

[2020-07-20 18:10:50.211][313][critical][assert] [./test/integration/fake_upstream.h:365] assert failure: parented_ || allow_unexpected_disconnects_. Details: An queued upstream connection was torn down without being associated with a fake connection. Either manage the connection via waitForRawConnection() or waitForHttpConnection(), or set_allow_unexpected_disconnects(true).
 See https://github.com/envoyproxy/envoy/blob/master/test/integration/README.md#unparented-upstream-connections

tsan failures:

@jmarantz
Copy link
Contributor Author

jmarantz commented Jul 20, 2020

Repro instructions:

bazel test --config=libc++ //test/integration:hds_integration_test  --runs_per_test=100

@mattklein123
Copy link
Member

From a very quick look at this test, it seems like there is a race condition where after calling cleanupHdsConnection Envoy can make a new HDS connection so we have to either handle spurious connections there or shutdown HDS somehow? @htuch @alyssawilk do you have any context here?

@mattklein123
Copy link
Member

Hmm nevermind I see hds_upstream_->set_allow_unexpected_disconnects(true);, will keep looking.

@jmarantz
Copy link
Contributor Author

FWIW I find this fails 98% of the time if I switch to simulated time, via:

diff --git a/test/integration/hds_integration_test.cc b/test/integration/hds_integration_test.cc
index f5fc80cdd5..7aa3b2b658 100644
--- a/test/integration/hds_integration_test.cc
+++ b/test/integration/hds_integration_test.cc
@@ -26,7 +26,8 @@ namespace Envoy {
 namespace {
 
 // TODO(jmarantz): switch this to simulated-time after debugging flakes.
-class HdsIntegrationTest : public Grpc::VersionedGrpcClientIntegrationParamTest,
+class HdsIntegrationTest : public Event::TestUsingSimulatedTime,
+                           public Grpc::VersionedGrpcClientIntegrationParamTest,
                            public HttpIntegrationTest {
 public:
   HdsIntegrationTest() : HttpIntegrationTest(Http::CodecClient::Type::HTTP1, ipVersion()) {}

I think this might be because this test uses gRPC which does not (afaik) have injectable time. But maybe this change offers a way to get more consistent failures, which might be easier to repro and debug.

@mattklein123
Copy link
Member

Yeah I'm having trouble reproing. I will try again tomorrow.

@mattklein123
Copy link
Member

In looking at the traces, I think the problem has to be:

server_health_check_specifier_.mutable_cluster_health_checks(0)
->mutable_health_checks(0)
->mutable_timeout()
->set_seconds(0);
server_health_check_specifier_.mutable_cluster_health_checks(0)
->mutable_health_checks(0)
->mutable_timeout()
->set_nanos(100000000); // 0.1 seconds

Envoy can close the connection while the connection is being popped off of the connection queue. There must be some race condition in the framework here. I can look more tomorrow unless @alyssawilk has any quick ideas. It's possible that recent event loop changes from @antoniovicente have made this race more likely though IIRC the events in question here aren't getting delayed to the next event loop iteration but I don't remember the details.

@mattklein123
Copy link
Member

(Most relevant tsan trace: https://gist.github.com/jmarantz/edc6a4349344c7620f98b00180e69ed7)

@antoniovicente
Copy link
Contributor

I think that the Envoy::Event::FileEventImpl::assignEvents in the call stack indicates that we are dealing with real events, not fake events.

I think the sequence here is something along the lines of:

  1. proxy creates upstream connection
  2. main thread calls waitForHttpConnection to capture the fake http upstream connection
  3. proxy times out the connection and sends a fin
  4. fake upstream detects the FIN and closes the fake connection

There are no synchronization primitives protecting the fake http upstream and its IO handle. TSAN detects a cross thread read/write interaction between steps 2 and 4 above.

The cross thread interactions in FakeUpstream are complex. We end up doing a fair number of operations in the main test thread while other operations happen in the backend thread. IIRC mutex use in FakeUpstream is very minimal.

@mattklein123
Copy link
Member

Yeah agreed this is complicated. From looking at the TSAN trace it actually seems like to fix this we would need to block the fake upstream event loop from running or we have to do more operations on the fake upstream thread. I don't think there is an easy fix here. I will think about it more tomorrow.

@mattklein123
Copy link
Member

(The best fix would be to switch the timeout tests to simulated time @jmarantz :) )

@jmarantz
Copy link
Contributor Author

jmarantz commented Jul 21, 2020

I think most of the tests I saw fail in real time involved direct calls to do timed waits on condvars from grpc, which lacks injectable time, afaict.

@alyssawilk
Copy link
Contributor

FWIW many tsan failures are just because an assertion fails, and the test doesn't handle things if the unexpected thing occurs.
Looks like something is non-deterministic about the healthcheckEndpoints waitForHttpConnection on 79, but I'm not sure what.

Josh since you can repro, can you dump trace logs of a failure or two?

@mattklein123
Copy link
Member

@alyssawilk see #12184 (comment) and #12184 (comment). It's a race condition around connection timeout/close. I think I have an idea on how to fix. I will try to do it today.

mattklein123 added a commit that referenced this issue Jul 21, 2020
Part of #12184

Signed-off-by: Matt Klein <mklein@lyft.com>
@htuch
Copy link
Member

htuch commented Jul 21, 2020

CC @drewsortega

mattklein123 added a commit that referenced this issue Jul 21, 2020
Part of #12184

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member

@drewsortega I think you are being volunteered to work on this further. :)

I'm able to repro more issues under TSAN that I haven't had a chance to fix yet:

TEST_P(HdsIntegrationTest, TestUpdateMessage) {

This one is getting stuck in the wait loop in the test. I haven't debugged it.

TEST_P(HdsIntegrationTest, TestUpdateChangesTimer) {

This test is fundamentally broken. It relies on real time guarantees which are not satisfied in CI and under the sanitizers. We either need to switch to SimTime, drop the test, or test it some other way.

I also suspect

TEST_P(HdsIntegrationTest, TestDefaultTimer) {
is broken in a similar way but I haven't seen it flake.

cc @htuch

@htuch
Copy link
Member

htuch commented Jul 22, 2020

FWIW, I think we discussed in at least one PR the desire for simulated time, there was some 10x slowdown that prevented this from being used at the time, see #5977.

@drewsortega
Copy link
Contributor

Sure, I can work on this one further. :)
I'm touching a lot of hds code right now, so it seems reasonable for me to work on this alongside.

KBaichoo pushed a commit to KBaichoo/envoy that referenced this issue Jul 30, 2020
Part of envoyproxy#12184

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
scheler pushed a commit to scheler/envoy that referenced this issue Aug 4, 2020
Part of envoyproxy#12184

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: scheler <santosh.cheler@appdynamics.com>
@xdzhai
Copy link
Contributor

xdzhai commented Aug 19, 2020

Drew tries many times, but was not able to reproduce the flakiness even if run 1000 times with tsan.
I also had a try and seems not reproducible. But I was able to reproduce it with the code at 2020-07-27, the log shows failure in waitForGrpcMessage().

There was a commit on 8/13: testing: fix multiple race conditions in simulated time tests (#12527) which improved the code of fake_upstream and probably has fixed the flakiness.

@mattklein123
Copy link
Member

FWIW I've seen failures on main branch since the above fix, on ARM release, which tends to run very fast. I will take a look and see if I can repro.

antoniovicente pushed a commit to antoniovicente/envoy that referenced this issue Sep 30, 2020
Part of envoyproxy#12184

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>
lambdai pushed a commit that referenced this issue Oct 16, 2020
* hds: fix integration test flakes (#12214)

Part of #12184

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* Switch to a tsan-instrumented libc++ for tsan tests (#12134)

This fixes #9784 and re-enables vhds_integration_test

Risk Level: Low, but will most likely increase memory usage

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Signed-off-by: Antonio Vicente <avd@google.com>

* test: shard hds_integration_test (#12482)

This should avoid TSAN timeout flakes.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* test: shard http2_integration_test (#11939)

This should mitigate TSAN timeout.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Antonio Vicente <avd@google.com>

* test: fix http2_integration_test flake (#12450)

Fixes #12442

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* Kick CI

Signed-off-by: Antonio Vicente <avd@google.com>

Co-authored-by: Matt Klein <mklein@lyft.com>
Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com>
Co-authored-by: Lizan Zhou <lizan@tetrate.io>
istio-testing pushed a commit to istio/envoy that referenced this issue Jan 8, 2021
* docs: kick-off 1.15.1 release. (envoyproxy#12166)

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

* tls: update BoringSSL-FIPS to 20190808. (envoyproxy#12170)

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

* test: Exclude wasm_vm_test from CI by making it a "manual" test. (#207)

The wee v8 build times out in CI under --config=asan because the machine the job is scheduled on is too small.

Signed-off-by: Antonio Vicente <avd@google.com>

* [v1.15] http: header map security fixes for duplicate headers (#197) (#200)

Previously header matching did not match on all headers for
non-inline headers. This patch changes the default behavior to
always logically match on all headers. Multiple individual
headers will be logically concatenated with ',' similar to what
is done with inline headers. This makes the behavior effectively
consistent. This behavior can be temporary reverted by setting
the runtime value "envoy.reloadable_features.header_match_on_all_headers"
to "false".

Targeted fixes have been additionally performed on the following
extensions which make them consider all duplicate headers by default as
a comma concatenated list:
1) Any extension using CEL matching on headers.
2) The header to metadata filter.
3) The JWT filter.
4) The Lua filter.
Like primary header matching used in routing, RBAC, etc. this behavior
can be disabled by setting the runtime value
"envoy.reloadable_features.header_match_on_all_headers" to false.

Finally, the setCopy() header map API previously only set the first
header in the case of duplicate non-inline headers. setCopy() now
behaves similiarly to the other set*() APIs and replaces all found
headers with a single value. This may have had security implications
in the extauth filter which uses this API. This behavior can be disabled
by setting the runtime value
"envoy.reloadable_features.http_set_copy_replace_all_headers" to false.

Fixes https://github.com/envoyproxy/envoy-setec/issues/188

Signed-off-by: Matt Klein <mklein@lyft.com>

* backport to v1.15: Fix Kafka Repository Location (#223)

Update mirror used to fetch kafka dependency to a valid, working mirror.

Based on envoyproxy#13025
Resolves envoyproxy#13011

Signed-off-by: Antonio Vicente <avd@google.com>

* release: cutting 1.15.1 (#217)

Signed-off-by: Antonio Vicente <avd@google.com>

* docs: Fix release notes for v1.15.1 release. (envoyproxy#13318)

Signed-off-by: Antonio Vicente <avd@google.com>

* Backport flaky test and tsan fixes to releases/v1.15 branch (envoyproxy#13337)

* hds: fix integration test flakes (envoyproxy#12214)

Part of envoyproxy#12184

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* Switch to a tsan-instrumented libc++ for tsan tests (envoyproxy#12134)

This fixes envoyproxy#9784 and re-enables vhds_integration_test

Risk Level: Low, but will most likely increase memory usage

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Signed-off-by: Antonio Vicente <avd@google.com>

* test: shard hds_integration_test (envoyproxy#12482)

This should avoid TSAN timeout flakes.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* test: shard http2_integration_test (envoyproxy#11939)

This should mitigate TSAN timeout.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Antonio Vicente <avd@google.com>

* test: fix http2_integration_test flake (envoyproxy#12450)

Fixes envoyproxy#12442

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Antonio Vicente <avd@google.com>

* Kick CI

Signed-off-by: Antonio Vicente <avd@google.com>

Co-authored-by: Matt Klein <mklein@lyft.com>
Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com>
Co-authored-by: Lizan Zhou <lizan@tetrate.io>

* docs: kick off v1.15.3-dev (envoyproxy#13695)

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* 1.15: CI fixes backport (envoyproxy#13697)

Backport following commits to 1.15:
748b2ab (mac ci: try ignoring update failure (envoyproxy#13658), 2020-10-20)
f95f539 (ci: various improvements (envoyproxy#13660), 2020-10-20)
73d78f8 (ci: use multiple stage (envoyproxy#13557), 2020-10-15)
b7a4756 (ci: use azp for api and go-control-plane sync (envoyproxy#13550), 2020-10-14)
876a6bb (ci use azp to sync filter example (envoyproxy#13501), 2020-10-12)
a0f31ee (ci: use azp to generate docs (envoyproxy#13481), 2020-10-12)

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Co-authored-by: asraa <asraa@google.com>

* 1.15: fix CI script (envoyproxy#13724)

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* Prevent SEGFAULT when disabling listener (envoyproxy#13515) (envoyproxy#13903)

This prevents the stop_listening overload action from causing
segmentation faults that can occur if the action is enabled after the
listener has already shut down.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* proxy protocol: set downstreamRemoteAddress on StreamInfo (envoyproxy#14131) (envoyproxy#14169)

This fixes a regression which resulted in the downstreamRemoteAddress
on the StreamInfo for a connection not having the address supplied by
the proxy protocol filter, but instead having the address of the
directly connected peer.

This issue does not affect HTTP filters.

Fixes envoyproxy#14087

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* ci: temproray disable vhds_integration_test in TSAN (envoyproxy#12067) (envoyproxy#14217)

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") (envoyproxy#14165)

Commit Message: tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out
Additional Description:
Risk Level: low
Testing: just thread_local_store_test
Docs Changes: n/a
Release Notes: n/a

no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

Co-authored-by: Joshua Marantz <jmarantz@google.com>

* backport to v1.15: connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) (envoyproxy#14173)

* connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772)

Fixes SslSocket read resumption after readDisable when processing the SSL record that contains the last bytes of the HTTP message

Signed-off-by: Antonio Vicente <avd@google.com>

* backport to 1.15: udp: properly handle truncated/dropped datagrams (envoyproxy#14122) (envoyproxy#14166)

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
Co-authored-by: Matt Klein <mklein@lyft.com>
Co-authored-by: Christoph Pakulski <christoph@tetrate.io>

* backport to 1.15: vrp: allow supervisord to open its log file (envoyproxy#14066) (envoyproxy#14280)

Commit Message: Allow supervisord to open its log file
Additional Description:
Change the default location of the log file and give supervisord
permissions to write to it.

Risk Level: low
Testing: built image locally
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* rel 1.15: close release 1.15.3 (envoyproxy#14303)

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* Kick off rel 1.15.4. (envoyproxy#14323)

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* backport to 1.15: http: fix datadog and squash handling of responses without body (envoyproxy#13328) (envoyproxy#14458)

Commit Message: Fixing bugs in datadog and sqaush where unexpected bodyless responses would crash Envoy
Risk Level: low
Testing: new unit tests, updated certs
Docs Changes: n/a
Release Notes: inline
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
Co-authored-by: alyssawilk <alyssar@chromium.org>

* backport 1.15: http: fixing a bug with IPv6 hosts (envoyproxy#14273)

Fixing a bug where HTTP parser offsets for IPv6 hosts did not include [] and Envoy assumed it did.
This results in mis-parsing addresses for IPv6 CONNECT requests and IPv6 hosts in fully URLs over HTTP/1.1

Risk Level: low
Testing: new unit, integration tests
Docs Changes: n/a
Release Notes: inline

Signed-off-by: Shikugawa <rei@tetrate.io>
Co-authored-by: alyssawilk <alyssar@chromium.org>

* backport to 1.15: tls: fix detection of the upstream connection close event. (envoyproxy#13858) (envoyproxy#14568)

Fixes envoyproxy#13856.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

Co-authored-by: Piotr Sikora <piotrsikora@google.com>
Co-authored-by: antonio <avd@google.com>
Co-authored-by: Matt Klein <mklein@lyft.com>
Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com>
Co-authored-by: Lizan Zhou <lizan@tetrate.io>
Co-authored-by: Christoph Pakulski <christoph@tetrate.io>
Co-authored-by: asraa <asraa@google.com>
Co-authored-by: Joshua Marantz <jmarantz@google.com>
Co-authored-by: Rei Shimizu <Shikugawa@gmail.com>
Co-authored-by: alyssawilk <alyssar@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants