Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix StopAndTest after ongoing Stop #376

Merged
merged 2 commits into from
Jun 9, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented May 30, 2024

@brandur while working on some stuff in the demo tonight, I realized that the hard stop was no longer working correctly if initiated after a soft Stop() was ongoing. I believe a regression or at least a change in behavior occurred with #272, because the shutdown logic in the demo app used to properly handle these shutdown scenarios.

Perhaps this isn't exactly a "bug", though it did used to work this way. In the past you didn't need to cancel the context for Stop() prior to calling StopAndCancel(), but now it appears that is required. This is because the start/stop service has a mutex which is held while Stop() is ongoing and so a subsequent concurrent call to StopAndCancel() cannot proceed until that first call has been "released" by cancelling its context.

I didn't have time to get any further on this tonight but figured you might have thoughts.

@bgentry bgentry requested a review from brandur May 30, 2024 03:49
@bgentry bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from d605319 to 23a3960 Compare May 30, 2024 20:11
@bgentry bgentry changed the title add broken test case for StopAndTest after ongoing Stop fix StopAndTest after ongoing Stop May 30, 2024
@bgentry bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 23a3960 to 30c3d0b Compare May 30, 2024 20:12
@bgentry bgentry marked this pull request as ready for review May 30, 2024 20:12
@bgentry bgentry added the bug Something isn't working label May 30, 2024
@bgentry bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 30c3d0b to be91b7e Compare May 30, 2024 20:14
client_test.go Outdated
close(jobDoneChan)
return nil
}))
setupStopAndCancel := func(t *testing.T) (*Client[pgx.Tx], *testBundle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a test case ends up needing its own setup function, what do you think about denesting it to the stop instead of wiring it in deeper and deeper? It might seem minor, but feels cleaner, requires a lot less indentation, and makes the test case name easier to address because there's fewer segments to find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already had one for StopAndCancel and it was essentially redundant with these new clauses. I extracted these and merged it all into a top level block.

client_test.go Outdated
select {
case <-client.Stopped():
t.Fatal("expected client to not be stopped yet")
case <-time.After(500 * time.Millisecond):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of my usual broad repartee about sleep statements in tests, I'll try to quantify it a little more by showing how slow River's test suite is getting.

Here's a test run of one of my largest packages at work, encompassing the entire API layer:

$ gotestsum ./server/api
✓  server/api (2.155s)

DONE 1445 tests in 6.285s

Here's River's top level package tests running:

$ gotestsum .
✓  . (12.517s)

=== Skipped
=== SKIP: . Test_Client_Maintenance/Reindexer (0.00s)
    client_test.go:2532: Reindexer is disabled for further development

DONE 448 tests, 1 skipped in 13.731s

It's a little subjective, but I'd consider the package being tested at the top significantly more complex than River. It hits the DB on ~every test, has about two orders of magnitude more database models than River (~100 models compared to 4), and 4x the LOCs just in package itself (it'd be >10x difference if you factored in all dependent packages as well):

$ find . -name '*.go' -maxdepth 1 | xargs wc -l
   13180 total
$ find ./server/api -name '*.go' -maxdepth 1 | xargs wc -l
   57358 total

At first glance, it has 3x the number of test cases and runs twice as fast, but that's not quite right -- it's a lot more code so it's compile phase is much longer. Looking at only test run time, it actually has 3x the number of test cases and runs 6x faster than River's test suite (2.155s vs 12.517s), and with ~zero intermittency problems.

I'd have to do a little more legwork to prove it, but I think the main difference is the number of sleep statements that we've got everywhere. (With test DB manager the other major candidate.)

With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.

I find the "negative" sleeps (e.g. WaitOrTimeout) a little less objectionable because they should be ~instant unless there's a failure, but the "affirmative" sleeps that just bake in a fixed amount of time for every run should ideally be eradicated IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd like to avoid these as well. Do you have an idea of how else to test this bit reliably and quickly? Specifically, making sure the client is not yet stopped and where I need to call Stop() in another goroutine.

Copy link
Contributor Author

@bgentry bgentry Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this sleep statement, the test case requires a minimum of 500 ms to run. And latency aside, it also opens the door to intermittency problems -- is 500 ms a good number for both CI and local? If so, what about 250 ms? If that's okay, how about 125 ms? We're kind of just picking an arbitrary number and hoping it's long enough.

On this point, the intermittency is a little different with these cases where we're trying to make sure that something hasn't happened than the cases where we're trying to wait to ensure that something does happen. The reason is that any intermittency will at least surface itself as a flaky test that sometimes fails on something that should never happen, so that alone indicates something broken.

Of course it would be preferable to be able to test this condition in such a way that is ~instant and 100% deterministic—I'm just not sure how to make that work cleanly with the primitives we've exposed here. For example, how can we make sure that we're waiting until the Stop() has been initiated? Would you want to expand the surface area of BaseStartStop to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, how can we make sure that we're waiting until the Stop() has been initiated? Would you want to expand the surface area of BaseStartStop to expose this so it's available in tests? I think doing so would require some redesign of those internals given the way its mutex locks are currently structured.

I'm definitely not against this. The start/stop should only present the most minimal viable interface publicly, but we can add a second interface that reveals a few more internals for test-related purposes. I can take looking into this separately.

@bgentry bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch 2 times, most recently from 9bd4062 to 0f44f2b Compare June 3, 2024 16:19
@bgentry bgentry requested a review from brandur June 4, 2024 16:58
bgentry added 2 commits June 9, 2024 14:25
This regression was introduced in #272, which I noticed because our demo
app is not currently cancelling active jobs gracefully before exiting.

This commit fixes the behavior by always cancelling the work context
upon `StopAndCancel()`, regardless of whether or not another shutdown is
in progress.
@bgentry bgentry force-pushed the bg-broken-test-stop-and-cancel-after-stop branch from 0f44f2b to 3f8df29 Compare June 9, 2024 19:25
@bgentry bgentry enabled auto-merge (squash) June 9, 2024 19:26
@bgentry bgentry merged commit 75302b2 into master Jun 9, 2024
10 checks passed
@bgentry bgentry deleted the bg-broken-test-stop-and-cancel-after-stop branch June 9, 2024 19:29
@bgentry bgentry mentioned this pull request Jun 13, 2024
bgentry added a commit that referenced this pull request Jun 13, 2024
Prepare version 0.7.0 for release. This includes the bugfix from #376
(allow `StopAndCancel` to be called when a `Stop` is already ongoing),
and new features from #383 (allow customizing default `MaxAttempts` at
the client level) and #390 (`JobDelete` + `JobDeleteTx` APIs).
bgentry added a commit that referenced this pull request Jun 14, 2024
Prepare version 0.7.0 for release. This includes the bugfix from #376
(allow `StopAndCancel` to be called when a `Stop` is already ongoing),
and new features from #383 (allow customizing default `MaxAttempts` at
the client level) and #390 (`JobDelete` + `JobDeleteTx` APIs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants