Skip to content

Comments

Better sandbox lifecycle management#95

Merged
ftes merged 3 commits intoftes:mainfrom
nathanl:sandbox_owner
Nov 14, 2025
Merged

Better sandbox lifecycle management#95
ftes merged 3 commits intoftes:mainfrom
nathanl:sandbox_owner

Conversation

@nathanl
Copy link
Contributor

@nathanl nathanl commented Nov 6, 2025

Previously, I needed to add Process.sleep/1 calls to the ends of some of my tests to allow any LiveViews that were querying via the sandbox time to shut down before the test process ended; otherwise I would see connection ownership errors being logged. This PR makes two changes to remove that need.

  1. Use Ecto.Adapters.SQL.Sandbox.html.start_owner!/2 to make the sandbox owner not be the test process, and use an on_exit/1 to terminate the sandbox after the test process terminates, as shown here.
  2. Since that only adds a small amount of delay before the sandbox is terminated, add a sandbox_owner_shutdown_delay option so users can specify a number of additional milliseconds to wait before shutting the sandbox down.

Item 2 feels a bit smelly TBH, because it's a global (not per-test) option, and only guesswork can find the right value. But given that Playwright tests can't manage the LV processes in any way, I don't know a better solution, and in my case a 100 ms delay seems fine.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 6, 2025

@ftes what do you think? I don't love it, but I love my Process.sleep/1 per test even less. All I want is stable, clean test runs, even if they're slower.

@ftes
Copy link
Owner

ftes commented Nov 7, 2025

Slack discussion

doc:
"Define custom Playwright [selector engines](https://playwright.dev/docs/extensibility#custom-selector-engines)."
],
sandbox_owner_shutdown_delay: [
Copy link
Owner

@ftes ftes Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add this key to setup_keys below (line 130), this could be overridden per test/describe/suite via exunit @tag sandbox_owner_shutdown_delay: x.

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea. Will try it out.

Copy link
Contributor Author

@nathanl nathanl Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well and makes using this much nicer, since we don't have to keep increasing the global value for the test that has the longest-running dangling processes. 😌

@ftes
Copy link
Owner

ftes commented Nov 7, 2025

Hey Nathan, this is a highly welcome PR!

Can we avoid the slowdown by stopping the sandbox owner in a separate, unlinked, process?
So async instead of blocking the next test run?

I guess the overall question is: Do we care if stopping the sandbox fails?
Or more specific: How likely is it to fail? What happens to the rest of the test suite run if it fails?

@ftes
Copy link
Owner

ftes commented Nov 7, 2025

The existing implementation was copied from wallaby.

Given that wallaby has been around a lot longer, and the standard Wallaby.Feature implementation suffers from the same problems, it might be worthwhile to find existing solutions/workarounds for wallaby.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 7, 2025

Can we avoid the slowdown by stopping the sandbox owner in a separate, unlinked, process?

I did some tinkering, but every way I tried this created errors, and I think I finally realized why. The on_exit docs say:

If on_exit/2 is called inside setup/1 or inside a test, it's executed in a blocking fashion after the test exits and before running the next test. This means that no other test from the same test case will be running while the on_exit/2 callback for a previous test is running.

If we use a spawn/1 inside our on_exit/1 code, that guarantee is broken; the on_exit/1 finishes, the next test starts, and the sandbox owner process is still running, maybe in "shared" mode if we're not doing async tests. So the processes that are spun up for the next test start using that old sandbox connection, only to have it die soon after.

Bottom line: we need the on_exit/1 to terminate the sandbox before it's finished. But having the sleep time configurable per test makes that less painful than if it were one global value.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 7, 2025

Regarding wallaby - we had lots of flakiness with our wallaby tests in the past and eventually abandoned them. (I say "we", it was mostly another team, so I can't say much in detail.) So maybe a different approach here is good.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 7, 2025

Since we seem to be on the same page and I'm up to 7 commits, I'm going to squash it down to one clean one.

defp maybe_start_sandbox_owner(repo, context, config) do
case start_sandbox_owner(repo, context) do
{:ok, pid} ->
delay = context[:sandbox_shutdown_delay] || config[:sandbox_shutdown_delay]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nil in the context if not specified via tag

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config[:sandbox_owner_shutdown_delay] should be sufficient.
It is filled with the tag value from ExUnit context, and falls back to global config otherwise.
Case.do_setup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nice, thanks

Delay in milliseconds before shutting down the Ecto sandbox owner after a
test ends. Use this to allow LiveViews and other processes in your app
time to stop using database connections before the sandbox owner is
terminated. Default is 0 (immediate shutdown).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders fine in ExDoc and is easier than using <>.

Image


In that case, you may encounter ownership errors like:
```
** (DBConnection.OwnershipError) cannot find owner for ...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this renders like an error in the docs. 😄

Image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is cool!

@ftes
Copy link
Owner

ftes commented Nov 10, 2025

If we use a spawn/1 inside our on_exit/1 code, that guarantee is broken; the on_exit/1 finishes, the next test starts, and the sandbox owner process is still running, maybe in "shared" mode if we're not doing async tests. So the processes that are spun up for the next test start using that old sandbox connection, only to have it die soon after.

Is this only an issue in shared mode?
If so: Can't we just use the existing (no-op) code for shared mode - since we don't need sandboxing there at all?

I'll try to find some time to tinker a bit myself.

@ftes
Copy link
Owner

ftes commented Nov 10, 2025

Can you reproduce the original error and validate the fix in an example repo (e.g. mix phx.new + mix phx.gen.auth)?

@nathanl
Copy link
Contributor Author

nathanl commented Nov 10, 2025

Is this only an issue in shared mode?

The issue of having a subsequent test re-use the old test's sandbox is, yes. So in async tests we can have on_exit spawn a process to shut down the sandbox. I pushed that change.

Can't we just use the existing (no-op) code for shared mode - since we don't need sandboxing there at all?

I'm not sure what you mean here. We do need sandboxing both in shared mode and outside it, it's just a difference in how the sandbox is shared between the test and other processes.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 10, 2025

Can you reproduce the original error and validate the fix in an example repo (e.g. mix phx.new + mix phx.gen.auth)?

I'll see what I can do.

@nathanl
Copy link
Contributor Author

nathanl commented Nov 10, 2025

@ftes see Query page for a demo

Previously, the test process itself was the made owner of the sandbox
process, which meant the sandbox would be terminated when the test
terminated.

Unlike some other kinds of tests, playwright tests are not linked to the
LiveViews and other processes that are being tested, so those processes
don't terminate with the test and may go on trying to use the sandbox
connection after the test terminates. If they do, they will raise
ownership errors.

Switch to `Ecto.Adapters.SQL.Sandbox.html.start_owner!/2`, creating a
separate process to own the sandbox connection, which is shut down after
the test process terminates using `on_exit/1`. This reduces ownership
errors that may occur after the test terminates.

By default, this is done with no delay, but `sandbox_shutdown_delay` can
be configured or specified via tag, which will wait that long before
terminating the sandbox during `on_exit/1`.
Copy link
Owner

@ftes ftes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some Ecto tests.
Maybe we can use improve those as a follow-up.

Either way: Thanks so much Nathan!

@ftes ftes merged commit 3b54699 into ftes:main Nov 14, 2025
1 check passed
@ftes
Copy link
Owner

ftes commented Nov 14, 2025

I've added some ecto based tests based your example (ftes/phoenix_test_playwright_example#5).

If you have time to look at them, that would be great.

@nathanl nathanl deleted the sandbox_owner branch November 14, 2025 14:42
@nathanl
Copy link
Contributor Author

nathanl commented Nov 14, 2025

Thank you!

pid = Sandbox.start_owner!(repo, shared: !context.async)
{:ok, pid}
rescue
_ -> {:error, :probably_already_started}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanl Do you remember if this blanket rescue clause was just a guess, or handling actual errors you saw?

I've got a report of this rescue masking checkout errors.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI removing the blanket rescue for now in 0a8538c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants