Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove hard exit hack #1280

Merged
merged 55 commits into from
Jun 23, 2021
Merged

Remove hard exit hack #1280

merged 55 commits into from
Jun 23, 2021

Conversation

natebosch
Copy link
Member

@natebosch natebosch commented Jun 19, 2020

Closes #599

Fixes #1278

Copy the `_loopback` code from `package:multi_server_socket` but model
it as a `List<ServerSocket>` instead of a single server. Remove the
unnecessary arguments for handling anything other than port 0.
See what fails...
@grouma
Copy link
Member

grouma commented Jun 19, 2020

Note this fixes flakiness so you may need to run Travis a few times.

@natebosch
Copy link
Member Author

I'm wondering if the hack covered up the problem in #1278 (comment)

I'm also wondering if the fix could give us some clue why we have needed this hack. The bug that I fixed with f355958 is an unhandled async error. But since that error never surfaces I'm not sure where it's disappearing too, and if the error is preventing the VM from shutting down somehow.

I'm not 100% sure that the unhandled async error I fixed there makes a real difference though, or if it's something else going on.

If there is no exception and we never close the servers they can hold
the process open.

I'm not sure if this fully explains the flakiness on travis...
@natebosch
Copy link
Member Author

I wonder if there were multiple things that could trigger this.

It seems to me like the node socket servers not getting shut down should cause the VM to stay open, but the fact that this is the happy case and not the error case makes me wonder why it doesn't always cause a problem.

} catch (_) {
unawaited(server.close().catchError((_) {}));
rethrow;
}

We seem to be able to reliably trigger this behavior with a bug in the node platform which causes an unhandled async error.

That error would be effectively swallowed at

// However, users don't think of load tests as "tests", so the error isn't
// helpful for them.
if (liveTest.suite.isLoadSuite) return;

But I don't know why having that error surface there specifically would be a problem.

@natebosch
Copy link
Member Author

Ugh, fixing the new node issue does not fix things overall, we can still hit flakes.

@natebosch
Copy link
Member Author

Skipping the test pkgs/test/test/runner/hybrid_test.dart looks like it may resolve things. It's flaky so it's hard to be sure but I've had a few runs through travis without seeing the problem surface.

There was one run where there was a failure that looks really similar - in this case the test runner as a subprocess did emit the correct output, but then never exited so when the test timed out it was killed. This doesn't explain the outer timeouts that need to be killed by travis.

https://travis-ci.org/github/dart-lang/test/jobs/700382066

If I run this test locally I don't ever see a failure or case where the test runner hangs, so far 25 successes in a row.

I might verify that restoring the test restores the flakiness, then see if I can narrow it down to any single test case.

After these hacks there is no `Invoker` retained in hello world test
cases.

- Don't pass through executable arguments. This isn't safe for things
  like `--enable-vm-service`.
- Eagerly resolve some static futures.
- Add an extra delay for paranoia, and print when observatory should be
  "settled".
- Don't use an AsyncMemo for closing the runner. Refactor away from
  async/await to old style future chaining. This prevents the `_awaiter`
  field on the Future from holding on to the entire closure context and
  preventing GC.
- Avoid reusing a static final future. Change to a getter which reads
  from a variable if it has already been resolved.
- Fix some bugs around double newlines in observatory message.
Hack around bugs in the VM service protocol with uncaught errors.
@natebosch
Copy link
Member Author

Hmm, I had hoped it was those ReceivePort instances, but it looks like maybe not. Looks like the same type of problem still happening. I wonder if there are multiple root causes.

18:29 +99 ~3 -1: test/runner/coverage_test.dart: with the --coverage flag, gathers coverage for Chrome tests [E]                                                                                       
  TimeoutException after 0:00:30.000000: Test timed out after 30 seconds. See https://pub.dev/packages/test#timeouts
  package:test_api/src/backend/invoker.dart 318:28  Invoker._handleError.<fn>
  dart:async                                        _CustomZone.run
  package:test_api/src/backend/invoker.dart 316:10  Invoker._handleError
  package:test_api/src/backend/invoker.dart 272:9   Invoker.heartbeat.<fn>.<fn>
  dart:async                                        _CustomZone.run
  package:test_api/src/backend/invoker.dart 271:38  Invoker.heartbeat.<fn>
  
Process `dart bin/test.dart` was killed with SIGKILL in a tear-down. Output:
    00:00 +0: compiling test.dart
    Compiled 9,441,162 characters Dart to 1,107,621 characters JavaScript in 8.02 seconds
    Dart file /tmp/dart_test_SXVRVD/runInBrowser.dart compiled to JavaScript: ../dart_test_HHBTDN/test_ZOUPUH/test.dart.browser_test.dart.js
    
    00:00 +0: test 1
    00:00 +1: All tests passed!
22:21 +130 ~3 -1: Some tests failed. 

@natebosch
Copy link
Member Author

When I last was looking at this I had found some receive ports that needed to be closed, and confirmed that this resolves an existing issue.

The other failure that was showing up even after the fix hasn't reproed for me after 3 runs so far on Github Actions. I don't know if it was that migration, or some other fix in the test runner since I last checked which fixed it, or if it just hasn't shown up as a flake yet.

I think it makes sense to merge this for now, and if flakes show up again we can restore the hack pending further investigation.

@natebosch natebosch marked this pull request as ready for review June 23, 2021 21:04
@natebosch natebosch merged commit d8a0918 into master Jun 23, 2021
@natebosch natebosch deleted the hard-exit-travis branch June 23, 2021 21:36
natebosch added a commit that referenced this pull request Mar 6, 2024
The `IsolateChannel` should close the receive port when we call
`outerChannel.sink.close` which is also added to the `cleanupCallbacks`.

This was originally added in #1280 to try to fix cases where the test
runner could hang after running some tests specifically on the test
runner CI when running the test as a subprocess. That PR went through
changes closing a number of ports speculatively, and this was one likely
not necessary.
natebosch added a commit that referenced this pull request Mar 7, 2024
The `IsolateChannel` should close the receive port when we call
`outerChannel.sink.close` which is also added to the `cleanupCallbacks`.

This was originally added in #1280 to try to fix cases where the test
runner could hang after running some tests specifically on the test
runner CI when running the test as a subprocess. It looks like when
it was introduced there was no call to `outerChannel.sink.close()`.
A cleanup callback to close the sink was added in #1941 which made
this unnecessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/test VM tests flaky timeout on Travis
3 participants