Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestNewExporter_collectorConnectionDiesThenReconnects #1527

Closed
MrAlias opened this issue Feb 11, 2021 · 12 comments · Fixed by #1816 or #2329
Closed

Flaky test: TestNewExporter_collectorConnectionDiesThenReconnects #1527

MrAlias opened this issue Feb 11, 2021 · 12 comments · Fixed by #1816 or #2329
Assignees
Labels
bug Something isn't working
Milestone

Comments

@MrAlias
Copy link
Contributor

MrAlias commented Feb 11, 2021

Reporting to be failing about %5 of the time on Go 1.15 but not seen on 1.14.

Version tested: e50a1c8

% docker run -v $(pwd)/opentelemetry-go:/otsrc --rm -w /otsrc/exporters/otlp/otlpgrpc golang:1.15 go test -run=TestNewExporter_collectorConnectionDiesThenReconnects
go: downloading google.golang.org/grpc v1.35.0
go: downloading github.com/gogo/protobuf v1.3.2
punya@obsidian ~ % docker run -v $(pwd)/opentelemetry-go:/otsrc --rm -w /otsrc/exporters/otlp/otlpgrpc golang:1.15 go test -run=TestNewExporter_collectorConnectionDiesThenReconnects -count 100 -v
go: downloading google.golang.org/grpc v1.35.0
go: downloading github.com/stretchr/testify v1.7.0
go: downloading github.com/gogo/protobuf v1.3.2
go: downloading gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c
go: downloading github.com/pmezard/go-difflib v1.0.0
go: downloading github.com/davecgh/go-spew v1.1.0
go: downloading google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013
go: downloading github.com/golang/protobuf v1.4.2
go: downloading golang.org/x/net v0.0.0-20201021035429-f5854403a974
go: downloading golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f
go: downloading google.golang.org/protobuf v1.25.0
go: downloading golang.org/x/text v0.3.3
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
    otlp_integration_test.go:182: 
                Error Trace:    otlp_integration_test.go:182
                Error:          Received unexpected error:
                                rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:43939: connect: connection refused"
                Test:           TestNewExporter_collectorConnectionDiesThenReconnects
--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.73s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
    otlp_integration_test.go:182: 
                Error Trace:    otlp_integration_test.go:182
                Error:          Received unexpected error:
                                rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:36879: connect: connection refused"
                Test:           TestNewExporter_collectorConnectionDiesThenReconnects
--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.49s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
    otlp_integration_test.go:182: 
                Error Trace:    otlp_integration_test.go:182
                Error:          Received unexpected error:
                                rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:36283: connect: connection refused"
                Test:           TestNewExporter_collectorConnectionDiesThenReconnects
--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.24s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
    otlp_integration_test.go:182: 
                Error Trace:    otlp_integration_test.go:182
                Error:          Received unexpected error:
                                rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:40819: connect: connection refused"
                Test:           TestNewExporter_collectorConnectionDiesThenReconnects
--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.24s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
    otlp_integration_test.go:182: 
                Error Trace:    otlp_integration_test.go:182
                Error:          Received unexpected error:
                                rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:37631: connect: connection refused"
                Test:           TestNewExporter_collectorConnectionDiesThenReconnects
--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.49s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.89s)
=== RUN   TestNewExporter_collectorConnectionDiesThenReconnects
--- PASS: TestNewExporter_collectorConnectionDiesThenReconnects (0.90s)
FAIL
exit status 1
FAIL    go.opentelemetry.io/otel/exporters/otlp/otlpgrpc        87.372s
@MrAlias MrAlias added the bug Something isn't working label Feb 11, 2021
@punya
Copy link
Member

punya commented Feb 11, 2021

Actually, I see it failing 5% of the time on 1.14 as well. It's just that in one particular CI run, 1.15 failed and 1.14 passed. There are other CI runs where the opposite happens too.

@MrAlias
Copy link
Contributor Author

MrAlias commented Feb 12, 2021

Possible duplicate of #1524

@XSAM
Copy link
Member

XSAM commented Mar 13, 2021

I saw such errors happen in CI again 🙈. Based commit: 8b1be11

--- FAIL: TestNewExporter_collectorConnectionDiesThenReconnects (0.76s)
    otlp_integration_test.go:248: 
        	Error Trace:	otlp_integration_test.go:248
        	Error:      	Received unexpected error:
        	            	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:49312: connect: connection refused"
        	Test:       	TestNewExporter_collectorConnectionDiesThenReconnects
FAIL

@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 14, 2021

The offending test waits for the reconnect loop to re-establish the connection but never yields being the scheduled goroutine. Adding a runtime.Gosched call appears to help this greatly. Planning to submit a change with this.

@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 14, 2021

There seems to be a correlation with the testing system. In the past month this error has occurred with the following frequency:

OS Go version # of failures
Mac 1.14 14
Mac 1.15 16
Ubuntu 1.14 4
Ubuntu 1.15 1

@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 14, 2021

I think we can update the mock collector to wrap the underlying TCP listener and signal when a new connection is actually made. This will allow the tests to deterministically wait for the connection of timeout at the CLI specified timeout.

@MrAlias

This comment has been minimized.

MrAlias added a commit to MrAlias/opentelemetry-go that referenced this issue Apr 15, 2021
MrAlias added a commit to MrAlias/opentelemetry-go that referenced this issue Apr 15, 2021
@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 15, 2021

I seem to have resolved this issue in b17df01

I created #1815 as a test runner to validate this. Without this commit, the failure present regularly: https://github.com/open-telemetry/opentelemetry-go/runs/2356749118

With the commit, the tests have succeeded 200+ times without a failure: https://github.com/open-telemetry/opentelemetry-go/pull/1815/checks?check_run_id=2357037389

That commit needs to be cleaned up and I'll submit it as a PR.

@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 21, 2021

It looks like this issue is not resolved. I'm still seeing failures on my arch linux system as well as the Ubuntu test runners (Go 1.14)

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 20, 2021

The grpc ClientConn "handles errors on established connections by re-resolving the name and reconnecting". Having this reconnection logic in the exporter seems duplicative and unnecessary. We should look into removing it and in the process remove this test.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 20, 2021

The grpc ClientConn "handles errors on established connections by re-resolving the name and reconnecting". Having this reconnection logic in the exporter seems duplicative and unnecessary. We should look into removing it and in the process remove this test.

Initial manual testing of this approach seems to validate the connection itself handles retries and reconnection. It appears this is an approach worth pursuing.

@MrAlias
Copy link
Contributor Author

MrAlias commented Nov 4, 2021

PoC I plan to fix into a valid PR to remove this test #2329

@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment