-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix StopAndTest after ongoing Stop #376
Conversation
d605319
to
23a3960
Compare
23a3960
to
30c3d0b
Compare
30c3d0b
to
be91b7e
Compare
client_test.go
Outdated
close(jobDoneChan) | ||
return nil | ||
})) | ||
setupStopAndCancel := func(t *testing.T) (*Client[pgx.Tx], *testBundle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a test case ends up needing its own setup function, what do you think about denesting it to the stop instead of wiring it in deeper and deeper? It might seem minor, but feels cleaner, requires a lot less indentation, and makes the test case name easier to address because there's fewer segments to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually already had one for StopAndCancel
and it was essentially redundant with these new clauses. I extracted these and merged it all into a top level block.
client_test.go
Outdated
select { | ||
case <-client.Stopped(): | ||
t.Fatal("expected client to not be stopped yet") | ||
case <-time.After(500 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of my usual broad repartee about sleep statements in tests, I'll try to quantify it a little more by showing how slow River's test suite is getting.
Here's a test run of one of my largest packages at work, encompassing the entire API layer:
$ gotestsum ./server/api
✓ server/api (2.155s)
DONE 1445 tests in 6.285s
Here's River's top level package tests running:
$ gotestsum .
✓ . (12.517s)
=== Skipped
=== SKIP: . Test_Client_Maintenance/Reindexer (0.00s)
client_test.go:2532: Reindexer is disabled for further development
DONE 448 tests, 1 skipped in 13.731s
It's a little subjective, but I'd consider the package being tested at the top significantly more complex than River. It hits the DB on ~every test, has about two orders of magnitude more database models than River (~100 models compared to 4), and 4x the LOCs just in package itself (it'd be >10x difference if you factored in all dependent packages as well):
$ find . -name '*.go' -maxdepth 1 | xargs wc -l
13180 total
$ find ./server/api -name '*.go' -maxdepth 1 | xargs wc -l
57358 total
At first glance, it has 3x the number of test cases and runs twice as fast, but that's not quite right -- it's a lot more code so it's compile phase is much longer. Looking at only test run time, it actually has 3x the number of test cases and runs 6x faster than River's test suite (2.155s vs 12.517s), and with ~zero intermittency problems.
I'd have to do a little more legwork to prove it, but I think the main difference is the number of sleep statements that we've got everywhere. (With test DB manager the other major candidate.)
With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.
I find the "negative" sleeps (e.g. WaitOrTimeout
) a little less objectionable because they should be ~instant unless there's a failure, but the "affirmative" sleeps that just bake in a fixed amount of time for every run should ideally be eradicated IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd like to avoid these as well. Do you have an idea of how else to test this bit reliably and quickly? Specifically, making sure the client is not yet stopped and where I need to call Stop()
in another goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.
On this point, the intermittency is a little different with these cases where we're trying to make sure that something hasn't happened than the cases where we're trying to wait to ensure that something does happen. The reason is that any intermittency will at least surface itself as a flaky test that sometimes fails on something that should never happen, so that alone indicates something broken.
Of course it would be preferable to be able to test this condition in such a way that is ~instant and 100% deterministic—I'm just not sure how to make that work cleanly with the primitives we've exposed here. For example, how can we make sure that we're waiting until the Stop()
has been initiated? Would you want to expand the surface area of BaseStartStop
to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, how can we make sure that we're waiting until the
Stop()
has been initiated? Would you want to expand the surface area ofBaseStartStop
to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.
I'm definitely not against this. The start/stop should only present the most minimal viable interface publicly, but we can add a second interface that reveals a few more internals for test-related purposes. I can take looking into this separately.
9bd4062
to
0f44f2b
Compare
This regression was introduced in #272, which I noticed because our demo app is not currently cancelling active jobs gracefully before exiting. This commit fixes the behavior by always cancelling the work context upon `StopAndCancel()`, regardless of whether or not another shutdown is in progress.
0f44f2b
to
3f8df29
Compare
@brandur while working on some stuff in the demo tonight, I realized that the hard stop was no longer working correctly if initiated after a soft
Stop()
was ongoing. I believe a regression or at least a change in behavior occurred with #272, because the shutdown logic in the demo app used to properly handle these shutdown scenarios.Perhaps this isn't exactly a "bug", though it did used to work this way. In the past you didn't need to cancel the context for
Stop()
prior to callingStopAndCancel()
, but now it appears that is required. This is because the start/stop service has a mutex which is held whileStop()
is ongoing and so a subsequent concurrent call toStopAndCancel()
cannot proceed until that first call has been "released" by cancelling its context.I didn't have time to get any further on this tonight but figured you might have thoughts.