fix StopAndTest after ongoing Stop #376

bgentry · 2024-05-30T03:49:22Z

@brandur while working on some stuff in the demo tonight, I realized that the hard stop was no longer working correctly if initiated after a soft Stop() was ongoing. I believe a regression or at least a change in behavior occurred with #272, because the shutdown logic in the demo app used to properly handle these shutdown scenarios.

Perhaps this isn't exactly a "bug", though it did used to work this way. In the past you didn't need to cancel the context for Stop() prior to calling StopAndCancel(), but now it appears that is required. This is because the start/stop service has a mutex which is held while Stop() is ongoing and so a subsequent concurrent call to StopAndCancel() cannot proceed until that first call has been "released" by cancelling its context.

I didn't have time to get any further on this tonight but figured you might have thoughts.

brandur · 2024-06-03T14:21:39Z

client_test.go

-			close(jobDoneChan)
-			return nil
-		}))
+		setupStopAndCancel := func(t *testing.T) (*Client[pgx.Tx], *testBundle) {


If a test case ends up needing its own setup function, what do you think about denesting it to the stop instead of wiring it in deeper and deeper? It might seem minor, but feels cleaner, requires a lot less indentation, and makes the test case name easier to address because there's fewer segments to find.

We actually already had one for StopAndCancel and it was essentially redundant with these new clauses. I extracted these and merged it all into a top level block.

brandur · 2024-06-03T14:29:56Z

client_test.go

+			select {
+			case <-client.Stopped():
+				t.Fatal("expected client to not be stopped yet")
+			case <-time.After(500 * time.Millisecond):


Instead of my usual broad repartee about sleep statements in tests, I'll try to quantify it a little more by showing how slow River's test suite is getting.

Here's a test run of one of my largest packages at work, encompassing the entire API layer:

$ gotestsum ./server/api ✓ server/api (2.155s) DONE 1445 tests in 6.285s

Here's River's top level package tests running:

$ gotestsum . ✓ . (12.517s) === Skipped === SKIP: . Test_Client_Maintenance/Reindexer (0.00s) client_test.go:2532: Reindexer is disabled for further development DONE 448 tests, 1 skipped in 13.731s

It's a little subjective, but I'd consider the package being tested at the top significantly more complex than River. It hits the DB on ~every test, has about two orders of magnitude more database models than River (~100 models compared to 4), and 4x the LOCs just in package itself (it'd be >10x difference if you factored in all dependent packages as well):

$ find . -name '*.go' -maxdepth 1 | xargs wc -l 13180 total

$ find ./server/api -name '*.go' -maxdepth 1 | xargs wc -l 57358 total

At first glance, it has 3x the number of test cases and runs twice as fast, but that's not quite right -- it's a lot more code so it's compile phase is much longer. Looking at only test run time, it actually has 3x the number of test cases and runs 6x faster than River's test suite (2.155s vs 12.517s), and with ~zero intermittency problems.

I'd have to do a little more legwork to prove it, but I think the main difference is the number of sleep statements that we've got everywhere. (With test DB manager the other major candidate.)

With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.

I find the "negative" sleeps (e.g. WaitOrTimeout) a little less objectionable because they should be ~instant unless there's a failure, but the "affirmative" sleeps that just bake in a fixed amount of time for every run should ideally be eradicated IMO.

Yeah, I'd like to avoid these as well. Do you have an idea of how else to test this bit reliably and quickly? Specifically, making sure the client is not yet stopped and where I need to call Stop() in another goroutine.

With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.

On this point, the intermittency is a little different with these cases where we're trying to make sure that something hasn't happened than the cases where we're trying to wait to ensure that something does happen. The reason is that any intermittency will at least surface itself as a flaky test that sometimes fails on something that should never happen, so that alone indicates something broken.

Of course it would be preferable to be able to test this condition in such a way that is ~instant and 100% deterministic—I'm just not sure how to make that work cleanly with the primitives we've exposed here. For example, how can we make sure that we're waiting until the Stop() has been initiated? Would you want to expand the surface area of BaseStartStop to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.

For example, how can we make sure that we're waiting until the Stop() has been initiated? Would you want to expand the surface area of BaseStartStop to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.

I'm definitely not against this. The start/stop should only present the most minimal viable interface publicly, but we can add a second interface that reveals a few more internals for test-related purposes. I can take looking into this separately.

This regression was introduced in #272, which I noticed because our demo app is not currently cancelling active jobs gracefully before exiting. This commit fixes the behavior by always cancelling the work context upon `StopAndCancel()`, regardless of whether or not another shutdown is in progress.

Prepare version 0.7.0 for release. This includes the bugfix from #376 (allow `StopAndCancel` to be called when a `Stop` is already ongoing), and new features from #383 (allow customizing default `MaxAttempts` at the client level) and #390 (`JobDelete` + `JobDeleteTx` APIs).

bgentry requested a review from brandur May 30, 2024 03:49

bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from d605319 to 23a3960 Compare May 30, 2024 20:11

bgentry changed the title ~~add broken test case for StopAndTest after ongoing Stop~~ fix StopAndTest after ongoing Stop May 30, 2024

bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 23a3960 to 30c3d0b Compare May 30, 2024 20:12

bgentry marked this pull request as ready for review May 30, 2024 20:12

bgentry added the bug Something isn't working label May 30, 2024

bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 30c3d0b to be91b7e Compare May 30, 2024 20:14

brandur reviewed Jun 3, 2024

View reviewed changes

bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch 2 times, most recently from 9bd4062 to 0f44f2b Compare June 3, 2024 16:19

bgentry requested a review from brandur June 4, 2024 16:58

brandur approved these changes Jun 9, 2024

View reviewed changes

bgentry added 2 commits June 9, 2024 14:25

refactor StopAndCancel tests

3f8df29

bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 0f44f2b to 3f8df29 Compare June 9, 2024 19:25

bgentry enabled auto-merge (squash) June 9, 2024 19:26

bgentry merged commit 75302b2 into master Jun 9, 2024
10 checks passed

bgentry deleted the bg-broken-test-stop-and-cancel-after-stop branch June 9, 2024 19:29

bgentry mentioned this pull request Jun 13, 2024

Prepare version 0.7.0 #394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix StopAndTest after ongoing Stop #376

fix StopAndTest after ongoing Stop #376

bgentry commented May 30, 2024

brandur Jun 3, 2024

bgentry Jun 3, 2024

brandur Jun 3, 2024

bgentry Jun 3, 2024

bgentry Jun 3, 2024 •

edited

Loading

brandur Jun 9, 2024

fix StopAndTest after ongoing Stop #376

fix StopAndTest after ongoing Stop #376

Conversation

bgentry commented May 30, 2024

brandur Jun 3, 2024

Choose a reason for hiding this comment

bgentry Jun 3, 2024

Choose a reason for hiding this comment

brandur Jun 3, 2024

Choose a reason for hiding this comment

bgentry Jun 3, 2024

Choose a reason for hiding this comment

bgentry Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

brandur Jun 9, 2024

Choose a reason for hiding this comment

bgentry Jun 3, 2024 •

edited

Loading